interrogation mark after CT launch

iruindegi · Mar 23, 2018

Previous setup:
- 1 Supermicro server
* (32 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (2 Sockets)
* 128 Gb Ram
* 3 SSD raid for proxmox system
* 3x4TB WD Red
* SHARED disk (NFS) for VM and CT

- 1 HP Microserver ( FreeNAS)
* Intel(R) Xeon(R) CPU E3-1220L V2 @ 2.30GHz
* 8GB ram
* 4x4TB WD Red hard drive

We installed some CT and VM with this setup on the shared (NFS - freenas) disk. Everything Ok.

After, we bought 2 Supermicro server more (same as previous) and created a cluster and ceph.
Cluster is formed with 4 machines (3 supermicro + 1 hp microserver)
Ceph is mounted with all WD Red disks available on Supermicro servers.
HP server (NFS) is used fot backup.

We did a backup of each VM and CT to freenas, and when we finished with the cluster config and ceph, we restored every backupt into the new setup (VM/CT disks on Ceph).

The PROBLEM is with the containers. If I shutdown a container and I try to launch the server again, it crash. After some minutes interrogation icon appears on every ct. If one CT was launched I can access to it and work, but I can't do anything more. If I connect to the server and launch `pct list` it doesn't do anything.

The only way to restore the system is restarting the server with RESET button.

I just did this now:

- I stoped the option "start at boot"
- When everything is ok, I tried to launch a imported CT and I got an error:

Code:

Job for pve-container@500.service failed because a timeout was exceeded.
See "systemctl status pve-container@500.service" and "journalctl -xe" for details.
TASK ERROR: command 'systemctl start pve-container@500' failed: exit code 1

after some minutes, interrogation mark appears and system hangs.

Does it mean that we can not import CTs (there is no problem with VMs) from the previous setup and we have to create the CTs again?

If I mark 'start on boot' to yes, the CT is launched ok. But If I shutdown it and try to start it hangs.

In the previous setup it hangs too, we thought maybe was due to NFS so we changed to ceph

any help or clue with this?

Alwin · Mar 26, 2018

How is your system setup? As the above hardware, sure has changed with the addition of ceph.

iruindegi · Mar 26, 2018

We added 2 new Supermicro servers (identical to the first one)

Alwin · Mar 26, 2018

Alwin said:
How is your system setup?

iruindegi · Mar 26, 2018

Current setup
- 3 Supermicro server
* (32 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (2 Sockets)
* 128 Gb Ram
* 3 SSD raid for proxmox system
* 3x4TB WD Red
* SHARED disk (NFS) for VM and CT

- 1 HP Microserver ( FreeNAS)
* Intel(R) Xeon(R) CPU E3-1220L V2 @ 2.30GHz
* 8GB ram
* 4x4TB WD Red hard drive

We have a cluster with this 4 servers but Ceph is installed within 3 supermicro servers.

Alwin · Mar 26, 2018

Is ceph on three separate servers?
On which disks is ceph installed?
How is ceph configured (size/min_size, replication, ...)?
Where is your containers storage located?
How is this storage configured (/etc/pve/storage.cfg)?
How are these server connected?
What is shown in the syslog/journal?

iruindegi · Mar 28, 2018

sorry, I was on holydays....

Is ceph on three separate servers?
Yes it is, in the 3 supermicro server

On which disks is ceph installed?
Each server has 3xSSD raid where proxmox and ceph are installed

How is ceph configured (size/min_size, replication, ...)?

Code:

[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = X.X.X.X/24
     fsid = XXX
     keyring = /etc/pve/priv/$cluster.$name.keyring
     mon allow pool delete = true
     osd journal size = 5120
     osd pool default min size = 2
     osd pool default size = 3
     public network = X.X.X.X/24

Where is your containers storage located?
Now, it is on Ceph PoolXXXX_ct

How is this storage configured (/etc/pve/storage.cfg)?

Code:

root@pve1:~# cat /etc/pve/storage.cfg
dir: local
    path /var/lib/vz
    content iso,backup,vztmpl

lvmthin: local-lvm
    thinpool data
    vgname pve
    content images,rootdir

rbd: PoolZerbi_vm
    content images
    krbd 0
    pool PoolZerbi

rbd: PoolZerbi_ct
    content rootdir
    krbd 1
    pool PoolZerbi

nfs: kopiakNAS
    export /mnt/NFSNAS/Birt
    path /mnt/pve/kopiakNAS
    server X.X.X.X
    content iso,backup,vztmpl
    maxfiles 1
    options vers=3

How are these server connected?
nowadays with 1Gb LAN (with bounding, so 2Gb).
The cluster network is within a VLAN (diferent)
The Ceph network is within a VLAN (diferent)

What is shown in the syslog/journal?
I just launch a CT witch was imported from previous setup (no cluster, no ceph, storage over nfs..) and after some minutes interrogation icon appears. Logs:
Syslog:

Code:

Mar 28 09:07:12 pve2 systemd[1]: pve-container@835.service: Start operation timed out. Terminating.
Mar 28 09:07:12 pve2 systemd[1]: Failed to start PVE LXC Container: 835.
Mar 28 09:07:12 pve2 systemd[1]: pve-container@835.service: Unit entered failed state.
Mar 28 09:07:12 pve2 systemd[1]: pve-container@835.service: Failed with result 'timeout'.
Mar 28 09:07:12 pve2 pvedaemon[2605245]: command 'systemctl start pve-container@835' failed: exit code 1
Mar 28 09:07:12 pve2 pvedaemon[2664]: <root@pam> end task UPID:pve2:0027C0BD:027BE5C2:5ABB3EC6:vzstart:835:root@pam: command 'systemctl start pve-container@835' failed: exit code 1

journal

Code:

Mar 28 09:07:12 pve2 systemd[1]: pve-container@835.service: Start operation timed out. Terminating.
Mar 28 09:07:12 pve2 systemd[1]: Failed to start PVE LXC Container: 835.
Mar 28 09:07:12 pve2 systemd[1]: pve-container@835.service: Unit entered failed state.
Mar 28 09:07:12 pve2 systemd[1]: pve-container@835.service: Failed with result 'timeout'.
Mar 28 09:07:12 pve2 pvedaemon[2605245]: command 'systemctl start pve-container@835' failed: exit code 1
Mar 28 09:07:12 pve2 pvedaemon[2664]: <root@pam> end task UPID:pve2:0027C0BD:027BE5C2:5ABB3EC6:vzstart:835:root@pam: command 'systemctl start pve-container@835' failed: exit code 1

now all VM and CT that were running still work fine, I can connect to the server via ssh, but
pct list is not showing anything.... I have to hit Control+c to kill the process
qm list is working fine

Alwin · Mar 28, 2018

iruindegi said:
On which disks is ceph installed?
Each server has 3xSSD raid where proxmox and ceph are installed

I suppose it is a RAID-5. This will break your performance. Ceph (as ZFS) want control over their disks (one disk, one daemon). The ceph log should tell you more about it. I guess, it might already show 'slow requests'.

iruindegi said:
How are these server connected?
nowadays with 1Gb LAN (with bounding, so 2Gb).
The cluster network is within a VLAN (diferent)
The Ceph network is within a VLAN (diferent)

The cluster, storage and client network need to be separated and up-sized if possible.

Corosync, needs a low (< 4ms) and stable latency, otherwise the cluster will not have quorum. In the worst case, all your nodes reboot simultaneously (if HA is activated), what you might already see, is that the /etc/pve is read-only (pvecm status -> no quorum).

Ceph likes low latency too and will use all the available bandwidth, either on recovery (disk failure) or through heavy client I/O. 2Gb is just not enough if you are using SSDs.

Client traffic will greatly interfere with the above two. Not to mention backup/restore, these will also add heavy I/O.

The above may be already enough to slow down the storage drastically and being the result of the timeout when starting VM/CT. You should see more information in the syslog/journal (and ceph logs).

EDIT: fixed some typos.

Search

Search

interrogation mark after CT launch

iruindegi

Well-Known Member

Alwin

Proxmox Retired Staff

iruindegi

Well-Known Member

Alwin

Proxmox Retired Staff

iruindegi

Well-Known Member

Alwin

Proxmox Retired Staff

iruindegi

Well-Known Member

Alwin

Proxmox Retired Staff