Server crash

giuppy

New Member
Dec 10, 2020
28
0
1
45
Hi,
I have a server within a cluster that time to time crashes and becomes totally unresponsive.
Attaching the last screen recorded before the crash.
Within the cluster it is the unique one using LXC containers (very small number).
I am sure I am not out of ram because there are something like 100GB assigned in a 1TB ram system.
Any suggestion?
Thanks!

Code:
pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 

Attachments

  • schermata.jpg
    schermata.jpg
    243.8 KB · Views: 17

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
677
118
hi,
Within the cluster it is the unique one using LXC containers (very small number).
I am sure I am not out of ram because there are something like 100GB assigned in a 1TB ram system.
Any suggestion?
* how much memory is assigned to each container?
* how many containers are running at the same time?
 

giuppy

New Member
Dec 10, 2020
28
0
1
45
Hi,
There are 4 containers running
1 x 4GB
2X 12GB
1X10GB

The host memory usage is 11.79% (118.81 GiB of 1007.76 GiB)
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
677
118
* what about tmpfs mounts inside your containers? check with mount | grep tmpfs inside the container

* can you post the container configurations as well? pct config CTID

* how does swap usage inside the containers look like?
 

giuppy

New Member
Dec 10, 2020
28
0
1
45
Here it is:
root@lxc2~# mount | grep tmpfs none on /dev type tmpfs (rw,relatime,size=492k,mode=755,uid=100000,gid=100000,inode64) udev on /dev/full type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/null type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/random type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/tty type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/urandom type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/zero type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) none on /proc/sys/kernel/random/boot_id type tmpfs (ro,nosuid,nodev,noexec,relatime,size=492k,mode=755,uid=100000,gid=100000,inode64) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,uid=100000,gid=100000,inode64) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755,uid=100000,gid=100000,inode64) tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,uid=100000,gid=100000,inode64)

pct config 126 arch: amd64 cores: 12 hostname: ***.com memory: 12288 net0: name=eth0,bridge=vmbr0,firewall=1,gw=10.2.70.1,hwaddr=E6:91:90:3F:EB:0A,ip=10.2.70.84/24,type=veth ostype: debian rootfs: NVME:subvol-126-disk-0,size=28G swap: 0 unprivileged: 1


Swap is disabled at host level
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
677
118
thanks, and how does df -h and free -m look in the containers?
 

giuppy

New Member
Dec 10, 2020
28
0
1
45
here:

Code:
df -h
Filesystem              Size  Used Avail Use% Mounted on
NVME/subvol-126-disk-0   28G   18G   11G  63% /
none                    492K  4.0K  488K   1% /dev
udev                    504G     0  504G   0% /dev/tty
tmpfs                   504G     0  504G   0% /dev/shm
tmpfs                   504G  145M  504G   1% /run
tmpfs                   5.0M     0  5.0M   0% /run/lock

Code:
free -m
              total        used        free      shared  buff/cache   available
Mem:          12288        3743        8227         149         316        8544
Swap:             0           0           0
 

oguz

Proxmox Staff Member
Staff member
Nov 19, 2018
5,207
677
118
looks normal to me, could you please attach the journal from your PVE host?
 

jameswang

New Member
Mar 10, 2022
12
1
3
55
I had a 7.2 server that crashed randomly, sometimes several times a day. I enabled kernel dump and finally caught the error: one of the memory modules was bad. Replaced the module with a new one and the server has been up for 41 days now without a crash. Not sure if it's your case but it might be worth looking at.
 

giuppy

New Member
Dec 10, 2020
28
0
1
45
I had a 7.2 server that crashed randomly, sometimes several times a day. I enabled kernel dump and finally caught the error: one of the memory modules was bad. Replaced the module with a new one and the server has been up for 41 days now without a crash. Not sure if it's your case but it might be worth looking at.
No this is not the case. If memory could be the issue I would get this message in Dell Idrac
 

giuppy

New Member
Dec 10, 2020
28
0
1
45
looks normal to me, could you please attach the journal from your PVE host?
Plenty of this (these are the last just before the reboot)
Code:
ul 05 06:49:34 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:35 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:35 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:35 pve2 pvestatd[7387]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Jul 05 06:49:36 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:36 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:37 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:37 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:38 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:38 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:39 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:39 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:39 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:39 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:40 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:40 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:40 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:40 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:41 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:41 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:41 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:41 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:42 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:42 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:43 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:43 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:44 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:44 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:45 pve2 pvestatd[7387]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Jul 05 06:49:45 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:45 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:46 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:46 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:47 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:47 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:48 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:48 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:48 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:48 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:49 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:49 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:49 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:49 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:50 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:50 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:50 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:50 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
 

giuppy

New Member
Dec 10, 2020
28
0
1
45
hi!
Happened once again
Cattura0000.PNG
Link was not down at all. In router was operational at 10gb
Here when I rebooted the machine
Cattura0001.PNG
But was resulting again down link until I rebooted 3 times.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!