Server crash

giuppy

Member
Dec 10, 2020
29
0
6
46
Hi,
I have a server within a cluster that time to time crashes and becomes totally unresponsive.
Attaching the last screen recorded before the crash.
Within the cluster it is the unique one using LXC containers (very small number).
I am sure I am not out of ram because there are something like 100GB assigned in a 1TB ram system.
Any suggestion?
Thanks!

Code:
pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-1
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 

Attachments

  • schermata.jpg
    schermata.jpg
    243.8 KB · Views: 18
hi,
Within the cluster it is the unique one using LXC containers (very small number).
I am sure I am not out of ram because there are something like 100GB assigned in a 1TB ram system.
Any suggestion?
* how much memory is assigned to each container?
* how many containers are running at the same time?
 
Hi,
There are 4 containers running
1 x 4GB
2X 12GB
1X10GB

The host memory usage is 11.79% (118.81 GiB of 1007.76 GiB)
 
* what about tmpfs mounts inside your containers? check with mount | grep tmpfs inside the container

* can you post the container configurations as well? pct config CTID

* how does swap usage inside the containers look like?
 
Here it is:
root@lxc2~# mount | grep tmpfs none on /dev type tmpfs (rw,relatime,size=492k,mode=755,uid=100000,gid=100000,inode64) udev on /dev/full type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/null type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/random type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/tty type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/urandom type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) udev on /dev/zero type devtmpfs (rw,nosuid,relatime,size=528326268k,nr_inodes=132081567,mode=755,inode64) none on /proc/sys/kernel/random/boot_id type tmpfs (ro,nosuid,nodev,noexec,relatime,size=492k,mode=755,uid=100000,gid=100000,inode64) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,uid=100000,gid=100000,inode64) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755,uid=100000,gid=100000,inode64) tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,uid=100000,gid=100000,inode64)

pct config 126 arch: amd64 cores: 12 hostname: ***.com memory: 12288 net0: name=eth0,bridge=vmbr0,firewall=1,gw=10.2.70.1,hwaddr=E6:91:90:3F:EB:0A,ip=10.2.70.84/24,type=veth ostype: debian rootfs: NVME:subvol-126-disk-0,size=28G swap: 0 unprivileged: 1


Swap is disabled at host level
 
thanks, and how does df -h and free -m look in the containers?
 
here:

Code:
df -h
Filesystem              Size  Used Avail Use% Mounted on
NVME/subvol-126-disk-0   28G   18G   11G  63% /
none                    492K  4.0K  488K   1% /dev
udev                    504G     0  504G   0% /dev/tty
tmpfs                   504G     0  504G   0% /dev/shm
tmpfs                   504G  145M  504G   1% /run
tmpfs                   5.0M     0  5.0M   0% /run/lock

Code:
free -m
              total        used        free      shared  buff/cache   available
Mem:          12288        3743        8227         149         316        8544
Swap:             0           0           0
 
looks normal to me, could you please attach the journal from your PVE host?
 
I had a 7.2 server that crashed randomly, sometimes several times a day. I enabled kernel dump and finally caught the error: one of the memory modules was bad. Replaced the module with a new one and the server has been up for 41 days now without a crash. Not sure if it's your case but it might be worth looking at.
 
I had a 7.2 server that crashed randomly, sometimes several times a day. I enabled kernel dump and finally caught the error: one of the memory modules was bad. Replaced the module with a new one and the server has been up for 41 days now without a crash. Not sure if it's your case but it might be worth looking at.
No this is not the case. If memory could be the issue I would get this message in Dell Idrac
 
looks normal to me, could you please attach the journal from your PVE host?
Plenty of this (these are the last just before the reboot)
Code:
ul 05 06:49:34 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:35 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:35 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:35 pve2 pvestatd[7387]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Jul 05 06:49:36 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:36 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:37 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:37 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:38 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:38 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:39 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:39 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:39 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:39 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:40 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:40 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:40 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:40 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:41 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:41 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:41 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:41 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:42 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:42 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:43 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:43 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:44 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:44 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:45 pve2 pvestatd[7387]: r720int: error fetching datastores - 500 Can't connect to 192.168.33.100:8007 (No route to host)
Jul 05 06:49:45 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:45 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:46 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:46 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:47 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:47 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:48 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:48 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:48 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:48 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:49 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:49 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:49 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:49 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
Jul 05 06:49:50 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.100, on dev eth0
Jul 05 06:49:50 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff da 7c c0 25 76 0c 08 06
Jul 05 06:49:50 pve2 kernel: IPv4: martian source 172.16.12.90 from 172.16.12.60, on dev eth0
Jul 05 06:49:50 pve2 kernel: ll header: 00000000: ff ff ff ff ff ff d6 05 e1 9b fd 08 08 06
 
hi!
Happened once again
Cattura0000.PNG
Link was not down at all. In router was operational at 10gb
Here when I rebooted the machine
Cattura0001.PNG
But was resulting again down link until I rebooted 3 times.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!