Converted/migrated Servers does not reboot

Romsch

Well-Known Member
Feb 14, 2019
99
9
48
Erlangen, Germany
Hello!
We have a three node cluster, the storage for the vms is ceph. I have migrated lot of physical server to pve with clonezilla, i have also converted round about 15 vmware vms to pve. In the past without issues.
Now we had the problem (the third problem/server after a while) - an ubuntu 18.04 hangs up, kernel panic and wont reboot - no bootable disk. I tried a lot of different iso images to boot the vm and try to "enter" the disk, but the whole disk is empty.
One week later, another vm with sles 12 SP3, it hangs up. It is not available, no ping, the vm looks like that it is in a freeze state. Ok, i would like to reboot the vm, but the same error during boot, no bootable device... i tried also to boot with live iso .... no chance, only to restore the vm with backup helps.
Today the same issue, it looks like that the vm hangs or is in freeze state, stop and start does not work. No bootable device, now the restore is running..

Does anyone have same issues with vm´s with ceph? I never had this in the past, it is not clear why the vms loose all data on the ceph disk after a reboot because the vm hangs.

Thanks for any help!

proxmox-ve: 5.4-2 (running kernel: 4.15.18-18-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-6
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-17-pve: 4.15.18-43
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-10-pve: 4.15.18-32
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
 
Last edited:
Do you still have this problem? Just to be sure: This only affects VMs that got hung up and rebooting other VMs works normally?
 
Hello dominic,

thanks for your reply.
I guess - and one in the forum has nearly the same problems - it is/was a problem during the backupjob, but i am not sure.
With the Backupjob, if all nodes are selected with the backupjob, the problems will come on some VMs on the console:
1569925717672.png
The screenshots of the console:
1569925833616.png

1569925885918.png


And i changed the backupjob, that the backup server backs up one node after the other node - then it was mostly without issues.

But today on two VMs the same error :) I really dont know why - the screenshots of the two VMs see above.

This is a screenshot of the pve node, where the two VMs are running on (pve3):
1569926313183.png

One pve2 the same messages; on pve1 is nothing:

1569926472875.png

pve1:

1569926435361.png

And - the swap is really... bad. On all three nodes the swap is in usage (standard pve installation) and very often "red". When i move manually some VMs from one node to another node, the swap will mostly go from red to orange or blue status. We have enough RAM, maybe the swap or something else is the problem?

Best regards,

Roman
 

Attachments

  • 1569925771381.png
    1569925771381.png
    172.3 KB · Views: 1
  • 1569926175005.png
    1569926175005.png
    182.2 KB · Views: 1
What hardware is your cluster running on? What is the output of the following commands?
Code:
ceph status
ceph osd df
lsblk
 
Hi dominic,

thanks for your reply!

The Hardware (3x) is Supermicro X11DPi-NT LGA3647, iC622, E-ATX with Intel XEON 3106, 128GB DDR4-RAM. Proxmox is running on MegaRAID 9361-4i with two SSDs 240GB SATA Seagate Nytro 1351.
Ceph has its own physical separated 10 GB Network, also the pve cluster is a 10 GB network.

Here is the output of the commands:

ceph status:

cluster:
id: 4938cac1-e5d4-4c53-8581-2b8664f16361
health: HEALTH_OK

services:
mon: 3 daemons, quorum pve1,pve2,pve3
mgr: pve2(active), standbys: pve3
osd: 21 osds: 21 up, 21 in

data:
pools: 1 pools, 1024 pgs
objects: 1.07M objects, 4.05TiB
usage: 12.1TiB used, 7.04TiB / 19.1TiB avail
pgs: 1024 active+clean

io:
client: 7.58KiB/s rd, 1.84MiB/s wr, 1op/s rd, 67op/s wr



ceph osd df:

ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 ssd 0.90959 1.00000 931GiB 568GiB 364GiB 60.96 0.97 141
3 ssd 0.90959 1.00000 931GiB 568GiB 363GiB 61.00 0.97 141
6 ssd 0.90959 1.00000 931GiB 577GiB 354GiB 61.94 0.98 144
9 ssd 0.90959 1.00000 931GiB 572GiB 359GiB 61.44 0.97 143
14 ssd 0.90959 1.00000 931GiB 595GiB 337GiB 63.86 1.01 148
15 ssd 0.90959 1.00000 931GiB 607GiB 325GiB 65.16 1.03 151
18 ssd 0.90959 1.00000 931GiB 630GiB 301GiB 67.67 1.07 156
1 ssd 0.90959 1.00000 931GiB 577GiB 354GiB 61.95 0.98 144
4 ssd 0.90959 1.00000 931GiB 561GiB 370GiB 60.25 0.95 139
7 ssd 0.90959 1.00000 931GiB 654GiB 277GiB 70.24 1.11 163
10 ssd 0.90959 1.00000 931GiB 540GiB 391GiB 57.99 0.92 133
12 ssd 0.90959 1.00000 931GiB 625GiB 306GiB 67.16 1.06 156
16 ssd 0.90959 1.00000 931GiB 596GiB 335GiB 63.99 1.01 149
19 ssd 0.90959 1.00000 931GiB 563GiB 369GiB 60.41 0.96 140
2 ssd 0.90959 1.00000 931GiB 615GiB 316GiB 66.02 1.05 153
5 ssd 0.90959 1.00000 931GiB 622GiB 309GiB 66.78 1.06 155
8 ssd 0.90959 1.00000 931GiB 543GiB 388GiB 58.30 0.92 135
11 ssd 0.90959 1.00000 931GiB 514GiB 417GiB 55.22 0.87 128
13 ssd 0.90959 1.00000 931GiB 599GiB 333GiB 64.29 1.02 149
17 ssd 0.90959 1.00000 931GiB 595GiB 336GiB 63.89 1.01 148
20 ssd 0.90959 1.00000 931GiB 629GiB 302GiB 67.55 1.07 156
TOTAL 19.1TiB 12.1TiB 7.04TiB 63.15
MIN/MAX VAR: 0.87/1.11 STDDEV: 3.64




lsblk:


NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 222.6G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part /boot/efi
└─sda3 8:3 0 222.1G 0 part
├─pve-swap 253:0 0 8G 0 lvm [SWAP]
├─pve-root 253:1 0 55.5G 0 lvm /
├─pve-data_tmeta 253:2 0 1.4G 0 lvm
│ └─pve-data-tpool 253:4 0 139.7G 0 lvm
│ └─pve-data 253:5 0 139.7G 0 lvm
└─pve-data_tdata 253:3 0 139.7G 0 lvm
└─pve-data-tpool 253:4 0 139.7G 0 lvm
└─pve-data 253:5 0 139.7G 0 lvm
sdb 8:16 0 931.5G 0 disk
├─sdb1 8:17 0 100M 0 part /var/lib/ceph/osd/ceph-14
└─sdb2 8:18 0 931.4G 0 part
sdc 8:32 0 931.5G 0 disk
├─sdc1 8:33 0 100M 0 part /var/lib/ceph/osd/ceph-0
└─sdc2 8:34 0 931.4G 0 part
sdd 8:48 0 931.5G 0 disk
├─sdd1 8:49 0 100M 0 part /var/lib/ceph/osd/ceph-3
└─sdd2 8:50 0 931.4G 0 part
sde 8:64 0 931.5G 0 disk
├─sde1 8:65 0 100M 0 part /var/lib/ceph/osd/ceph-6
└─sde2 8:66 0 931.4G 0 part
sdf 8:80 0 931.5G 0 disk
├─sdf1 8:81 0 100M 0 part /var/lib/ceph/osd/ceph-9
└─sdf2 8:82 0 931.4G 0 part
sdg 8:96 0 931.5G 0 disk
├─sdg1 8:97 0 100M 0 part /var/lib/ceph/osd/ceph-15
└─sdg2 8:98 0 931.4G 0 part
sdh 8:112 0 931.5G 0 disk
├─sdh1 8:113 0 100M 0 part /var/lib/ceph/osd/ceph-18
└─sdh2 8:114 0 931.4G 0 part



Best regards, roman
 
Thank you!

What storage controller are your VMs using? SATA, IDE, SCSI or VirtIO? This can be seen with
Code:
cat /etc/pve/nodes/YOUR_NODE/qemu-server/VM_ID.conf



running on MegaRAID 9361-4i with
Using a RAID controller in combination with Ceph may well be the source for your problems (see here). You could try to use a HBA instead.
 
Hi Dominic,
thx for your reply!

Please dont missunderstand me, i use ceph without a RAID! The MegaRAID 9361-4i is only for the Proxmox OS Installation!
The seven SSD harddrives are directly connected to the SATA port without a RAID. Ceph uses the 7 SSDs per nodes directly, for the whole cluster are 21 SSDs are available/configured.

From the 22 running VMs are 12 VMs with Virto SCSI Controller and VirtIO Harddisks (windows and linux created/restored), the other 10 VMs are migrated or rather converted from vmware, here i must use LSI standard controller and mostly IDE, some VMs have SATA harddrives because of compatibility, some converted VMs (the older ones like sles 9 and 10) does not run with Virtio SCSI controller and SCSI or SATA.

Thanks and best regards,

roman
 
i use ceph without a RAID!
Ok, good!

Is the subset of VMs that showed problems using all those controller types or only some?

And i changed the backupjob, that the backup server backs up one node after the other node - then it was mostly without issues.

But today on two VMs the same error :)

=> Your problems happen primarily when backups are created but sometimes also when not? Is there much other IO workload?


An insufficient storage connection could be (especially with SATA or IDE) the source for your problems. Do your ceph logs show any storage related problems like high latency or slow writes? You could try to limit bandwith for your backups, too.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!