PVE 4.1 to 4.2 upgrade: Corrupted CEPH VM images

wosp · Apr 28, 2016

On a small 3-node cluster, running 16 VM's, I did a upgrade from 4.1-22 to 4.2-2. Each node also has one CEPH OSD onboard. Cluster (Proxmox VE and CEPH) was completly healthy before the upgrade started. I did the upgrade node-by-node and before I finished a node and started with the following node I waited until CEPH was "HEALTH_OK" again. Before the first upgrade started all VM's where moved from node01 and node02 to node03. I upgraded node01 and node02, moved the VM's from node03 to node01 and node02 and upgraded node03. Then I moved the VM's for node03 back to node03. I was logging in to ALL VM's to test if everything was ok but on 2 of the VM's I got filesystem errors. When I rebooted them these VM's can't boot at all, fsck won't help: the CEPH VM images of those 2 VM's are corrupted (input/output errors). So I have to restore those VM's from backup. Ceph was already running 0.94.6, so no version change. But kernel is upgraded offcourse.

Ceph config:

Code:

[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  auth supported = cephx
  cluster network = 192.168.121.0/24
  filestore xattr use omap = true
  fsid = 2f5c1777-0fe5-483b-9f6a-c7d4a874c62a
  keyring = /etc/pve/priv/$cluster.$name.keyring
  osd journal size = 5120
  osd pool default min size = 1
  public network = 192.168.111.0/24

[osd]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring
  osd max backfills = 1
  osd recovery max active = 1

[mon.1]
  host = node02
  mon addr = 192.168.111.131:6789

[mon.0]
  host = node01
  mon addr = 192.168.111.130:6789

[mon.2]
  host = node03
  mon addr = 192.168.111.132:6789

Ceph crush:

# begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable straw_calc_version 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host node01 { id -2 # do not change unnecessarily # weight 0.360 alg straw hash 0 # rjenkins1 item osd.0 weight 0.360 } host node02 { id -3 # do not change unnecessarily # weight 0.360 alg straw hash 0 # rjenkins1 item osd.1 weight 0.360 } host node03 { id -4 # do not change unnecessarily # weight 0.360 alg straw hash 0 # rjenkins1 item osd.2 weight 0.360 } root default { id -1 # do not change unnecessarily # weight 1.080 alg straw hash 0 # rjenkins1 item node01 weight 0.360 item node02 weight 0.360 item node03 weight 0.360 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map

dietmar · Apr 28, 2016

wosp said:
Ceph was already running 0.94.6, so no version change. But kernel is upgraded offcourse.

We use the ceph user space library to access rdb images, so the kernel is out of scope here.

wosp · Apr 28, 2016

Okay, any other suggestions where to look for this problem? In logfiles nothing is found.
Tonight i'm going to move all VM's and reboot each node again, just to test what happens. Something special to watch for?

dietmar · Apr 28, 2016

Did you update any ceph packages? I would also test if VMs boot cleanly before you run the update.

wosp · Apr 28, 2016

No, this are the packages that were updated (maybe Ceph relies on one of these, but don't know that):

Code:

iproute2: 4.2.0-2 ==> 4.4.0-1
libelf1: 0.159-4.2 (new)
libnvpair1: 0.6.5-pve7~jessie ==> 0.6.5-pve9~jessie
libpve-access-control: 4.0-13 ==> 4.0-16
libpve-common-perl: 4.0-54 ==> 4.0-59
libpve-storage-perl: 4.0-45 ==> 4.0-50
libuutil1: 0.6.5-pve7~jessie ==> 0.6.5-pve9~jessie
libzfs2: 0.6.5-pve7~jessie ==> 0.6.5-pve9~jessie
libzpool2: 0.6.5-pve7~jessie ==> 0.6.5-pve9~jessie
novnc-pve: 0.5-5 ==> 0.5-6
proxmox-ve: 4.1-41 ==> 4.2-48
pve-cluster: 4.0-36 ==> 4.0-39
pve-container: 1.0-52 ==> 1.0-62
pve-firewall: 2.0-22 ==> 2.0-25
pve-firmware: 1.1-7 ==> 1.1-8
pve-ha-manager: 1.0-25 ==> 1.0-28
pve-kernel-4.4.6-1-pve: 4.4.6-48 (new)
pve-manager: 4.1-22 ==> 4.2-2
pve-qemu-kvm: 2.5-9 ==> 2.5-14
qemu-server: 4.0-64 ==> 4.0-72
spl: 0.6.5-pve3~jessie ==> 0.6.5-pve5~jessie
tar: 1.27.1-2+b1 ==> 1.27.1+pve.3
zfs-initramfs: 0.6.5-pve7~jessie ==> 0.6.5-pve9~jessie
zfsutils: 0.6.5-pve7~jessie ==> 0.6.5-pve9~jessie

I just did some more digging and thinking about this and some facts are possibly worth mentioning, although it may also that it is not related at all:

1. The node running the corrupted VM's (node02) was also the node that was fencing when I disabled hpet in grub config. See https://forum.proxmox.com/threads/hpet-and-watchdog.26962/

I still have this errors in logs on this node (and all other nodes, but on different times), but the node doesn't fence unnecessary anymore as long as hpet is enabled:

Code:

Apr 28 00:31:38 node02 kernel: [15851.110851] CE: hpet increased min_delta_ns to 20115 nsec
Apr 28 08:23:02 node02 kernel: [44135.767139] CE: hpet4 increased min_delta_ns to 20115 nsec
Apr 28 08:25:47 node02 kernel: [44300.897158] CE: hpet2 increased min_delta_ns to 20115 nsec

2. This cluster earlier had some problems with systemd-timesyncd + CEPH (see https://forum.proxmox.com/threads/pve-4-1-systemd-timesyncd-and-ceph-clock-skew.27043/). However, since I switched to ntp, no clock skew detected anymore.

3. On the console the following errors showed up on this node02:

Code:

Apr 28 00:11:21 node02 kernel: [14634.400135] kvm [23736]: vcpu0 unhandled rdmsr: 0x570
Apr 28 00:14:39 node02 kernel: [14831.980793] kvm [23736]: vcpu0 unhandled rdmsr: 0x570
Apr 28 00:23:46 node02 kernel: [15379.306083] kvm [23736]: vcpu0 unhandled rdmsr: 0x570
Apr 28 00:34:23 node02 kernel: [16016.003740] kvm [29151]: vcpu0 unhandled rdmsr: 0x570

I don't have these errors on any other node (they are written tot kern.log).

wosp · Apr 28, 2016

Update: I just did a double reboot of each node and migrated the VM's the same way I did yesterday. After all the nodes were rebooted I migrated them back to the node where they belong. Everything was OK, to be sure I stop-start every VM one-by-one, also no problems. Strange, but can't reproduce this problem.

wosp · May 12, 2016

Today I rebooted the nodes (and updated to Ceph 0.94.7) and the same issue occurs, again 2 of the CEPH VM images are corrupted (input/output errors), but not the same VM's as last time, only thing they have in common is they all are Elastix 2.x (CentOS 5.x with 2.6.x kernel) VM's. I thought it was just a one-time issue, but doesn't seem that way. Any ideas would be helpful.

wosp · May 14, 2016

Update: I did some more testing and don't know for sure, but seems that this has something to do with hpet. After I had disabled systemd-timesyncd (https://forum.proxmox.com/threads/pve-4-1-systemd-timesyncd-and-ceph-clock-skew.27043/), installed ntp and added "hpet=disable" (hpet=disable with systemd-timesyncd makes the system unstable) to grub config, the problem hasn't come back, while I tried to get it back really hard. Time will tell.

wosp · May 18, 2016

Unfortunately this problem isn't solved. I just again had the same issue with 2 VM disk images. But it came to my attention now that when I moved this 2 VM's not a single ping was lost. Normally with an online migration you lose 1-3 packets for each VM, so it seems to be an issue with the transfer of the VM from one node to another. This was never a problem and this cluster was running fine with 3.x and 4.1 before. Problem is there since upgrade to 4.2.

spirit · May 19, 2016

>>(CentOS 5.x with 2.6.x kernel) VM's.

do you use virtio disk ?

cache=none or cache=writeback ?

wosp · May 19, 2016

Hi Alexandre,

Yes, I always use virtio as disk driver and cache=none.
Thanks.

spirit · May 19, 2016

wosp said:
Hi Alexandre,

Yes, I always use virtio as disk driver and cache=none.
Thanks.

Do you have tested with ide ? (as it's a voip vm, I don't think disk performance need to be high ?)

virtio-blk drivers are pretty old (introduced in kernel =2.6.25, but stable only since 2.6.32), I think redhat/centos have backported them in the kernel, but I don't known which version.

wosp · May 19, 2016

spirit said:
Do you have tested with ide ? (as it's a voip vm, I don't think disk performance need to be high ?)

virtio-blk drivers are pretty old (introduced in kernel =2.6.25, but stable only since 2.6.32), I think redhat/centos have backported them in the kernel, but I don't known which version.

Last year I've tested on PVE 3.x with IDE as driver (while testing some performance differences with different drivers), but that maked the Elastix VM boot very slow (+/- 5 mins, with virtio +/- 1 mins), so don't know if that's a good option. But isn't it strange I didn't have this problem on PVE 3.x (for many months!) and PVE 4.1 (some weeks)? If it was in the VM (because it doesn't handle virtio driver good, for example), this problem should exist on older PVE releases to. Right?

Besides that, the VM image itself is corrupted and even a fsck with live CD doesn't help (input/output-error, I really need to restore from backup), so to me it seems to be a problem outside the VM.

When I first transfer the VM to another node it's OK and when that node is back from it's reboot and after some mins (waiting for ceph to be "HEALTH_OK") I transfer it back, sometimes a VM image is corrupted, but (untill now) never on the first transfer (before te reboot).

As said, one thing I noticed was when this occurs not a single ping was lost with the online migration. Looks like the transfer isn't completed well (although status is OK and VM is moved in GUI). When this occurs and I open the Web GUI most of the times I can login to Elastix, but after the first page is loaded (after logging in) the VM is crashed with lots of IO-errors on console. So it works just a couple of secs, then it's gone.

spirit · May 19, 2016

When I first transfer the VM to another node it's OK and when that node is back from it's reboot and after some mins (waiting for ceph to be "HEALTH_OK") I transfer it back, sometimes a VM image is corrupted, but (untill now) never on the first transfer (before te reboot).

do you have problem if you do live migrations a lot of time, without restart node (and ceph daemon) ?

wosp · May 19, 2016

spirit said:
do you have problem if you do live migrations a lot of time, without restart node (and ceph daemon) ?

First of all, thank you for all your help, greatly appreciated!

After some more testing today I have a workaround and I know where the problem is.
To answer your question: yes, I can confirm the problem also exists without a node restart or ceph daemon reload.

Normally I always migrate 2-3 VM's at the same time. Since I was testing today to test a lot of live migrations (without the restart) and to hurry up a little I was migrating 6 VM's at the same time. First migrated all VM's to node03, after that migrated all to node01, then node02 and last to node03 again (and remember: 6 VM's at the same time). After this migrations I checked all VM's and guess what...I had 6 corrupted images! Normally I always have 2-3 corrupted images.

Did some more testing and did a lot of live migrations of all VM's, but just one at the same time. Not a single corrupted image! To be 100% sure I did some more live migrations, but now with 3 VM's at the same time....and yes, after I've migrated all VM's 2 times, I had 3 VM's with corrupted images.

So, my workaround is simply to live migrate only 1 VM at the same time. As I already mentoined this wasn't a problem on PVE 4.1 (not applicable to PVE 3.x, since it was not possible to migrate more then 1 VM at the same time at all with PVE 3.x), so seems to be a bug in PVE 4.2. However, I only have corrupted images of Elastix VM's (10 of 16 VM's running on this cluster). What do you think, should I open a bugreport for this, or do you have any other suggestions?

Thank you!

Edit:
Almost forgot. Attached you will find an example of console output when a image is corrupted.

spirit · May 20, 2016

This is really strange,

differents migration process are really independent.

Maybe do you sature your bandwith with doing multiple migration process at the same time ?

can you post the migrations task log ?

do you see anything special in ceph logs during the vm migrations ?

wosp · May 20, 2016

spirit said:
differents migration process are really independent.

Are you sure about that? I doubt that since when I migrate (2) VM's at the same time (on all of my clusters, even on my 10 Gbit cluster) sometimes I see a "migration aborted" for one of the VM's. Looks like if 2 processes are exactly alike/meet each other somewhere, something fails. If I restart the transfer then it's OK (and the other live migration (that was already running) is finishing fine). Not really a problem, but this fact let me think it are not standalone/independent processes.

spirit said:
Maybe do you sature your bandwith with doing multiple migration process at the same time ?

I checked bandwith usage during migration, but there is not much traffic going while migrations are running (highest seen was a (almost) 400 Mbit peak on Gigabit link) when I live migrate. Since I use shared storage (over a dedicated LAN) only thing that needs to be transferred is the conf file and the memory dump (I think). The Elastix VM's only have 1-2 GB memory and 15 GB SSD each, while others (non Elastix VM's) even with 10+ GB memory and 50GB SSD each have no problems running live migration together. Also the network didn't changed when I upgraded to PVE 4.2.

spirit said:
can you post the migrations task log ?

All are the same, nothing special, no errors. Example:

Code:

task started by HA resource agent
May 19 21:08:28 starting migration of VM 1012 to node 'node02' (192.168.111.131)
May 19 21:08:28 copying disk images
May 19 21:08:28 starting VM 1012 on remote node 'node02'
May 19 21:08:32 starting ssh migration tunnel
May 19 21:08:32 starting online/live migration on localhost:60000
May 19 21:08:32 migrate_set_speed: 8589934592
May 19 21:08:32 migrate_set_downtime: 0.1
May 19 21:08:34 migration status: active (transferred 244466019, remaining 755748864), total 1082990592)
May 19 21:08:34 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
May 19 21:08:36 migration status: active (transferred 475535744, remaining 514940928), total 1082990592)
May 19 21:08:36 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
May 19 21:08:38 migration status: active (transferred 709747342, remaining 266219520), total 1082990592)
May 19 21:08:38 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
May 19 21:08:40 migration speed: 128.00 MB/s - downtime 57 ms
May 19 21:08:40 migration status: completed
May 19 21:08:44 migration finished successfully (duration 00:00:16)
TASK OK

And when I have a "migration aborted" like I mentioned above (this is from logs on PVE 4.1 cluster with no other issues):

Code:

task started by HA resource agent
May 19 13:22:03 starting migration of VM 100 to node 'host05' (192.168.110.138)
May 19 13:22:03 copying disk images
May 19 13:22:03 starting VM 100 on remote node 'host05'
May 19 13:22:03 trying to acquire lock... OK
May 19 13:22:05 starting ssh migration tunnel
May 19 13:22:06 starting online/live migration on localhost:60000
May 19 13:22:06 migrate_set_speed: 8589934592
May 19 13:22:06 migrate_set_downtime: 0.1
May 19 13:22:08 ERROR: online migrate failure - aborting
May 19 13:22:08 aborting phase 2 - cleanup resources
May 19 13:22:08 migrate_cancel
May 19 13:22:09 ERROR: migration finished with problems (duration 00:00:06)
TASK ERROR: migration problems

spirit said:
do you see anything special in ceph logs during the vm migrations ?

I have some deep-scrubs while migrations were running yesterday (but don't have them on 18-05-2016, when I also had 2 corrupted images with live migrations), thats all "special" that can be found in logs. Also see deep-scrubs in logfiles on other days when I was not migrating (4AM for example), according to http://ceph.com/planet/deep-scrub-distribution/ this is normal once a week for each pg (and I have 300 pgs, so not very special I think, but here is the log):

wosp · May 20, 2016

Code:

2016-05-19 20:23:15.298146 osd.1 192.168.111.131:6800/4120 183 : cluster [INF] 1.a9 scrub starts
2016-05-19 20:23:15.429815 osd.1 192.168.111.131:6800/4120 184 : cluster [INF] 1.a9 scrub ok
2016-05-19 20:38:42.435549 osd.1 192.168.111.131:6800/4120 185 : cluster [INF] 1.0 scrub starts
2016-05-19 20:38:42.470497 osd.1 192.168.111.131:6800/4120 186 : cluster [INF] 1.0 scrub ok
2016-05-19 20:42:51.939464 osd.0 192.168.111.130:6800/4459 877 : cluster [INF] 1.b5 scrub starts
2016-05-19 20:42:52.012392 osd.0 192.168.111.130:6800/4459 878 : cluster [INF] 1.b5 scrub ok
2016-05-19 20:54:35.107119 osd.0 192.168.111.130:6800/4459 879 : cluster [INF] 1.ba scrub starts
2016-05-19 20:54:35.181607 osd.0 192.168.111.130:6800/4459 880 : cluster [INF] 1.ba scrub ok
2016-05-19 20:54:52.650121 osd.1 192.168.111.131:6800/4120 187 : cluster [INF] 1.66 scrub starts
2016-05-19 20:54:52.754438 osd.1 192.168.111.131:6800/4120 188 : cluster [INF] 1.66 scrub ok
2016-05-19 21:00:00.000194 mon.0 192.168.111.130:6789/0 235153 : cluster [INF] HEALTH_OK
2016-05-19 21:03:47.839786 mon.0 192.168.111.130:6789/0 235248 : cluster [INF] pgmap v11730550: 300 pgs: 1 active+clean+scrubbing+deep, 299 active+clean; 146 GB data, 285 GB used, 814 GB / 1100 GB avail; 132 kB/s wr, 32 op/s
2016-05-19 21:03:46.735274 osd.2 192.168.111.132:6800/4156 934 : cluster [INF] 1.a3 deep-scrub starts
2016-05-19 21:03:48.876050 osd.2 192.168.111.132:6800/4156 935 : cluster [INF] 1.a3 deep-scrub ok
2016-05-19 21:03:51.556145 mon.0 192.168.111.130:6789/0 235249 : cluster [INF] pgmap v11730551: 300 pgs: 1 active+clean+scrubbing+deep, 299 active+clean; 146 GB data, 285 GB used, 814 GB / 1100 GB avail; 140 kB/s wr, 31 op/s
2016-05-19 21:03:52.561852 mon.0 192.168.111.130:6789/0 235250 : cluster [INF] pgmap v11730552: 300 pgs: 1 active+clean+scrubbing+deep, 299 active+clean; 146 GB data, 285 GB used, 814 GB / 1100 GB avail; 105 kB/s wr, 20 op/s
2016-05-19 21:03:51.025432 osd.2 192.168.111.132:6800/4156 936 : cluster [INF] 1.5b deep-scrub starts
2016-05-19 21:03:52.978709 osd.2 192.168.111.132:6800/4156 937 : cluster [INF] 1.5b deep-scrub ok
2016-05-19 21:03:53.025976 osd.2 192.168.111.132:6800/4156 938 : cluster [INF] 1.9e scrub starts
2016-05-19 21:03:53.141270 osd.2 192.168.111.132:6800/4156 939 : cluster [INF] 1.9e scrub ok
2016-05-19 21:03:54.026287 osd.2 192.168.111.132:6800/4156 940 : cluster [INF] 1.2c scrub starts
2016-05-19 21:03:54.119670 osd.2 192.168.111.132:6800/4156 941 : cluster [INF] 1.2c scrub ok
2016-05-19 21:03:58.027738 osd.2 192.168.111.132:6800/4156 942 : cluster [INF] 1.47 scrub starts
2016-05-19 21:03:58.150930 osd.2 192.168.111.132:6800/4156 943 : cluster [INF] 1.47 scrub ok
2016-05-19 21:04:00.028424 osd.2 192.168.111.132:6800/4156 944 : cluster [INF] 1.35 scrub starts
2016-05-19 21:04:00.137005 osd.2 192.168.111.132:6800/4156 945 : cluster [INF] 1.35 scrub ok
2016-05-19 21:04:52.042130 osd.2 192.168.111.132:6800/4156 946 : cluster [INF] 1.d3 scrub starts
2016-05-19 21:04:52.152387 osd.2 192.168.111.132:6800/4156 947 : cluster [INF] 1.d3 scrub ok
2016-05-19 21:06:49.010111 mon.0 192.168.111.130:6789/0 235353 : cluster [INF] pgmap v11730655: 300 pgs: 1 active+clean+scrubbing+deep, 299 active+clean; 146 GB data, 285 GB used, 814 GB / 1100 GB avail; 339 B/s rd, 136 kB/s wr, 44 op/s
2016-05-19 21:06:46.074856 osd.2 192.168.111.132:6800/4156 948 : cluster [INF] 1.bf deep-scrub starts
2016-05-19 21:06:48.552606 osd.2 192.168.111.132:6800/4156 949 : cluster [INF] 1.bf deep-scrub ok
2016-05-19 21:06:51.029914 mon.0 192.168.111.130:6789/0 235354 : cluster [INF] pgmap v11730656: 300 pgs: 1 active+clean+scrubbing+deep, 299 active+clean; 146 GB data, 285 GB used, 814 GB / 1100 GB avail; 338 B/s rd, 212 kB/s wr, 61 op/s
2016-05-19 21:06:53.042616 mon.0 192.168.111.130:6789/0 235355 : cluster [INF] pgmap v11730657: 300 pgs: 1 active+clean+scrubbing+deep, 299 active+clean; 146 GB data, 285 GB used, 814 GB / 1100 GB avail; 198 kB/s wr, 41 op/s
2016-05-19 21:13:30.380701 osd.2 192.168.111.132:6800/4156 950 : cluster [INF] 1.15 scrub starts
2016-05-19 21:13:30.424855 osd.2 192.168.111.132:6800/4156 951 : cluster [INF] 1.15 scrub ok
2016-05-19 21:24:34.481076 osd.0 192.168.111.130:6800/4459 881 : cluster [INF] 1.4 scrub starts
2016-05-19 21:24:34.545135 osd.0 192.168.111.130:6800/4459 882 : cluster [INF] 1.4 scrub ok
2016-05-19 21:29:01.082923 osd.1 192.168.111.131:6800/4120 189 : cluster [INF] 1.3c scrub starts
2016-05-19 21:29:01.199216 osd.1 192.168.111.131:6800/4120 190 : cluster [INF] 1.3c scrub ok

spirit · May 20, 2016

I doubt that since when I migrate (2) VM's at the same time (on all of my clusters, even on my 10 Gbit cluster) sometimes I see a "migration aborted" for one of the VM's

That is really strange.

The process migration is:
start a new vm on target destination (with same config) but in pause mode
establish a ssh tunnel
copy the memory from source vm to target vm
stop source vm
resume the target vm

maybe can you try to disable ssh tunnel
/etc/pve/datacenter.cfg
migration_unsecure: 1

(it should improve speed transfert, as ssh tunnel is cpu bound around 500mbits)

for the "
ERROR: online migrate failure - aborting", generaly is that qemu target process is crashing.
But we can't get the log in proxmox.

if you want to debug,
launch the migration process
on target host, look a the "kvm -id ..." , note the command line and migration port at the end
stop the migration process

then to do the live migration manually:

on target host:
"PVE_MIGRATED_FROM=sourcehostname kvm -id ....."

then, in gui, in vmsource monitor:
"migrate tcp:ipoftargethost

ort"

wosp · May 21, 2016

Okay, what I did today:

1. Upgrade to PVE 4.2-5/7cf09667 and rebooted all nodes. Just to be sure this isn't solved already in a upgraded version of PVE.
2. After rebootes finished I've migrated some VM's, same issues. So I'm sure the problem is still there and not solved in new version.
3. Added "migration_unsecure: 1" to /etc/pve/datacenter.cfg. Did some migrations and all went fine. I can even online migrate all (16) VM's at the same time without one "migration aborted" or a corrupted VM image. Moved all VM's at the same time from node01 to node02, to node03 and back to node01. Not a single issue!

4. I did remember I've added some config ("ClientAliveInterval 60" and "ClientAliveCountMax 3") to /etc/ssh/sshd_config in the past. So I commented them out (so it's factory default again), restarted SSHd on alle nodes and removed "migration_unsecure: 1" from /etc/pve/datacenter.cfg. After that I tested again, but issues came back. So it's not related to this settings. Re-added them to /etc/ssh/sshd_config and restarted SSHd on all nodes.
5. Again added "migration_unsecure: 1" to /etc/pve/datacenter.cfg. Now I tested live migrations again. I moved all VM's from node01 to node02, to node03, 3 times. So with 16 VM's running on this cluster I have 16 (VM's) x 3 (nodes) x 3 (migrations) + 16 (final migration back to node01) = 160 live migrations. Not a single issue (no migration aborted and no disk image corrupted).

And, now I can also use the "migrate all" function from GUI, which was not working when you are logged in as PVE user (only working as root normally), see https://forum.proxmox.com/threads/proxmox-3-4-migrate-all-vm-s-ha.21127/.

So, adding "migration_unsecure: 1" to /etc/pve/datacenter.cfg solves all my issues. Still very strange why the images are corrupted since PVE 4.2 upgrade and not before, but at least I have a solution now

. Also nice it solved some other (minor) issues I've had (but also exists in pre 4.2 versions). Thanks again for all your help!

PVE 4.1 to 4.2 upgrade: Corrupted CEPH VM images

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Attachments

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

We value your privacy