Proxmox v6.4-15 started to hang everynight...

mstgeo

Member
May 15, 2022
40
1
8
HI!

We started to have a BIG problem with one of our proxmox sevrers. It started to hang everynight! A hardware reset is needed to re-run it. Nagios is all red for VMs, no ssh access. Just KVM and power breaker do their job. Even a shutdown from the console is not helping. FYI, we have lxc / kvm machines there -- +150 LXC / ~10 KVM.
Server has:

CPU(s): 96 x Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz (2 Sockets)
Kernel Version: Linux 5.4.203-1-pve #1 SMP PVE 5.4.203-1 (Fri, 26 Aug 2022 14:43:35 +0200)
PVE Manager Version: pve-manager/6.4-15/af7986e6
RAM: 1.5TB
HDD: SOFT RAID-5 - 6 x 1.92TB NVME -- DATA / 2 x 960 GB SOFT RAID-1 -- SYSTEM

After a physical reset all is well during a whole day. Then at night again! A hang ! What the heck could it be ? Anyone can help here ? Thanks in advance !

I am attaching the syslog we got and our sysctl.conf, can be useful I hope... Thanks!

regards,
Grzegorz Leskiewicz
 

Attachments

Hi,
the excerpt from syslog you posted contains only information about hung tasks for commands uptime,pgrep and check_memory, telling from the backtrace when interacting a FUSE mount. Are there any other interesting messages earlier? You can check with e.g. cat /proc/mounts | grep fuse what you have. I'd also run a healthcheck for the disks and a memory test.

Please note that Proxmox VE 6 has been end-of-life since nearly one and a half years now: https://pve.proxmox.com/wiki/FAQ
To get (security) upgrades:
https://pve.proxmox.com/wiki/Upgrade_from_7_to_8
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
 
Hi!

Thx for a fast reply!

cat /proc/mounts | grep fuse

fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
/dev/fuse /etc/pve fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
root@111.111.111.111:/backup /mnt/backup fuse.sshfs rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0

As far as disks are concerned, SMART has PASSED status on all and a wearout is 4%. As to memory, we have 1.5 TB on board and can't allow a long downtime on a server to occur, so no way to test it now.

Would it be a good move to disable the below remote mount ? Can it be the cause of the hang up ? This is only happening at night 1 / 2 am CET. Maybe this is the point. No backup / other copying during that time...

root@111.111.111.111:/backup /mnt/backup fuse.sshfs rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0

regards,
Grzegorz Leskiewicz
 
Would it be a good move to disable the below remote mount ? Can it be the cause of the hang up ? This is only happening at night 1 / 2 am CET. Maybe this is the point. No backup / other copying during that time...

root@111.111.111.111:/backup /mnt/backup fuse.sshfs rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
It could, but it could also be the lxcfs or the cluster filesystem /etc/pve. You'd need to find out what the problematic commands accessed, e.g. are they running inside a container? If you don't require the mount, it's certainly worth a try.
 
It could, but it could also be the lxcfs or the cluster filesystem /etc/pve. You'd need to find out what the problematic commands accessed, e.g. are they running inside a container? If you don't require the mount, it's certainly worth a try.

Any other thoughts ? Why is it happening late at night ? 1 / 2 am ? There're no traffic then... During the day all is OK while on the haevy traffic, same after a restart all is OK for the whole day long... What else can I debug ?

Thx !
 
Any other thoughts ? Why is it happening late at night ? 1 / 2 am ? There're no traffic then... During the day all is OK while on the haevy traffic, same after a restart all is OK for the whole day long... What else can I debug ?
Is something happening on your server providing the sshfs at that time? I'd also check earlier parts of the syslog, there could be hints of what goes wrong before the actual errors appear.
 
Hi!

Got a hang yesterday at 1 am CET. Got the below...

Feb 8 00:08:01 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:04 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:04 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:06 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:07 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process




Feb 8 00:10:11 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:10:11 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:10:11 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:10:14 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process



and a hang and HW reboot !

.............................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................
.........................................................................Feb 8 00:14:01 proxmox-server systemd-modules-load[1967]: Inserted module 'iscsi_tcp'
Feb 8 00:14:01 proxmox-server kernel: [ 0.000000] Linux version 5.4.203-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.203-1 (Fri, 26 Aug 2022 14:43:35 +0200) ()
Feb 8 00:14:01 proxmox-server kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.203-1-pve root=UUID=0043175e-d430-4ce7-b28f-cce729993989 ro vga=normal nomodeset iommu=pt nosplash text biosdevname=0 net.ifnames=0 con
sole=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true



What the heck can it be ?

Thanks for help...


regards,
G.
 
Well, seems to be related to lxcfs then. Anything special running in your containers at that time? You can check out the logs closer with journalctl -b -u lxcfs.service.

As already said, your installed version is pretty old. While you can't be sure an update will solve the issue, it could very well be.
 
journalctl -b -u lxcfs.service
-- Logs begin at Thu 2024-02-08 00:13:54 GMT, end at Thu 2024-02-08 10:30:07 GMT. --
Feb 08 00:14:01 ludic-ovh-uk-01-srv systemd[1]: Started FUSE filesystem for LXC.
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: Running constructor lxcfs_init to reload liblxcfs
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: mount namespace: 4
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: hierarchies:
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 0: fd: 5:
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 1: fd: 6: name=systemd
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 2: fd: 7: freezer
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 3: fd: 8: cpu,cpuacct
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 4: fd: 9: hugetlb
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 5: fd: 10: memory
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 6: fd: 11: cpuset
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 7: fd: 12: perf_event
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 8: fd: 13: devices
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 9: fd: 14: net_cls,net_prio
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 10: fd: 15: blkio
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 11: fd: 16: rdma
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 12: fd: 17: pids
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: Kernel supports pidfds
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: Kernel supports swap accounting
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: api_extensions:
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - cgroups
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - sys_cpu_online
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_cpuinfo
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_diskstats
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_loadavg
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_meminfo
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_stat
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_swaps
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_uptime
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - shared_pidns
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - cpuview_daemon
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - loadavg_daemon
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - pidfds
 
Oh right, you have to use -b-1 to get the logs from the last boot.
 
hmm... strange...

journalctl -b-1 -u lxcfs.service
Specifying boot ID or boot offset has no effect, no persistent journal was found.
 
BTW, this is the cause of the problem:

utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process

What is this exactly ?
 
hmm... strange...

journalctl -b-1 -u lxcfs.service
Specifying boot ID or boot offset has no effect, no persistent journal was found.
Seems like you don't have a persistent systemd journal configured. See man 5 journald.conf or use your favorite search engine to find out more.
BTW, this is the cause of the problem:
It might also be just another symptom.
utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process

What is this exactly ?
Haven't dug into LXC internals much, but please upgrade first and see if the issue is still there with current versions.
 
Is it secure to upgrade a BIG prod with +100 VMs to v7 without any issues ? We can't allow any downtime there... We'd like to fix that and migrate to NEW hardware after that.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!