Proxmox v6.4-15 started to hang everynight...

mstgeo · Feb 7, 2024

HI!

We started to have a BIG problem with one of our proxmox sevrers. It started to hang everynight! A hardware reset is needed to re-run it. Nagios is all red for VMs, no ssh access. Just KVM and power breaker do their job. Even a shutdown from the console is not helping. FYI, we have lxc / kvm machines there -- +150 LXC / ~10 KVM.
Server has:

CPU(s): 96 x Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz (2 Sockets)
Kernel Version: Linux 5.4.203-1-pve #1 SMP PVE 5.4.203-1 (Fri, 26 Aug 2022 14:43:35 +0200)
PVE Manager Version: pve-manager/6.4-15/af7986e6
RAM: 1.5TB
HDD: SOFT RAID-5 - 6 x 1.92TB NVME -- DATA / 2 x 960 GB SOFT RAID-1 -- SYSTEM

After a physical reset all is well during a whole day. Then at night again! A hang ! What the heck could it be ? Anyone can help here ? Thanks in advance !

I am attaching the syslog we got and our sysctl.conf, can be useful I hope... Thanks!

regards,
Grzegorz Leskiewicz

fiona · Feb 7, 2024

Hi,
the excerpt from syslog you posted contains only information about hung tasks for commands uptime,pgrep and check_memory, telling from the backtrace when interacting a FUSE mount. Are there any other interesting messages earlier? You can check with e.g. cat /proc/mounts | grep fuse what you have. I'd also run a healthcheck for the disks and a memory test.

Please note that Proxmox VE 6 has been end-of-life since nearly one and a half years now: https://pve.proxmox.com/wiki/FAQ
To get (security) upgrades:
https://pve.proxmox.com/wiki/Upgrade_from_7_to_8
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

mstgeo · Feb 7, 2024

Hi!

Thx for a fast reply!

cat /proc/mounts | grep fuse

fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
/dev/fuse /etc/pve fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
root@111.111.111.111:/backup /mnt/backup fuse.sshfs rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0

As far as disks are concerned, SMART has PASSED status on all and a wearout is 4%. As to memory, we have 1.5 TB on board and can't allow a long downtime on a server to occur, so no way to test it now.

Would it be a good move to disable the below remote mount ? Can it be the cause of the hang up ? This is only happening at night 1 / 2 am CET. Maybe this is the point. No backup / other copying during that time...

root@111.111.111.111:/backup /mnt/backup fuse.sshfs rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0

regards,
Grzegorz Leskiewicz

fiona · Feb 7, 2024

mstgeo said:
Would it be a good move to disable the below remote mount ? Can it be the cause of the hang up ? This is only happening at night 1 / 2 am CET. Maybe this is the point. No backup / other copying during that time...

root@111.111.111.111:/backup /mnt/backup fuse.sshfs rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0

It could, but it could also be the lxcfs or the cluster filesystem /etc/pve. You'd need to find out what the problematic commands accessed, e.g. are they running inside a container? If you don't require the mount, it's certainly worth a try.

mstgeo · Feb 7, 2024

fiona said:
It could, but it could also be the lxcfs or the cluster filesystem /etc/pve. You'd need to find out what the problematic commands accessed, e.g. are they running inside a container? If you don't require the mount, it's certainly worth a try.

Any other thoughts ? Why is it happening late at night ? 1 / 2 am ? There're no traffic then... During the day all is OK while on the haevy traffic, same after a restart all is OK for the whole day long... What else can I debug ?

Thx !

fiona · Feb 8, 2024

mstgeo said:
Any other thoughts ? Why is it happening late at night ? 1 / 2 am ? There're no traffic then... During the day all is OK while on the haevy traffic, same after a restart all is OK for the whole day long... What else can I debug ?

Is something happening on your server providing the sshfs at that time? I'd also check earlier parts of the syslog, there could be hints of what goes wrong before the actual errors appear.

mstgeo · Feb 8, 2024

Hi!

Got a hang yesterday at 1 am CET. Got the below...

Feb 8 00:08:01 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:03 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:04 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:04 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:06 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:08:07 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process

Feb 8 00:10:11 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:10:11 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:10:11 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process
Feb 8 00:10:14 proxmox-server lxcfs[2737]: utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process

and a hang and HW reboot !

.............................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................
.........................................................................Feb 8 00:14:01 proxmox-server systemd-modules-load[1967]: Inserted module 'iscsi_tcp'
Feb 8 00:14:01 proxmox-server kernel: [ 0.000000] Linux version 5.4.203-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.203-1 (Fri, 26 Aug 2022 14:43:35 +0200) ()
Feb 8 00:14:01 proxmox-server kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.203-1-pve root=UUID=0043175e-d430-4ce7-b28f-cce729993989 ro vga=normal nomodeset iommu=pt nosplash text biosdevname=0 net.ifnames=0 con
sole=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true

What the heck can it be ?

Thanks for help...

regards,
G.

fiona · Feb 8, 2024

Well, seems to be related to lxcfs then. Anything special running in your containers at that time? You can check out the logs closer with journalctl -b -u lxcfs.service.

As already said, your installed version is pretty old. While you can't be sure an update will solve the issue, it could very well be.

mstgeo · Feb 8, 2024

journalctl -b -u lxcfs.service
-- Logs begin at Thu 2024-02-08 00:13:54 GMT, end at Thu 2024-02-08 10:30:07 GMT. --
Feb 08 00:14:01 ludic-ovh-uk-01-srv systemd[1]: Started FUSE filesystem for LXC.
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: Running constructor lxcfs_init to reload liblxcfs
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: mount namespace: 4
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: hierarchies:
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 0: fd: 5:
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 1: fd: 6: name=systemd
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 2: fd: 7: freezer
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 3: fd: 8: cpu,cpuacct
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 4: fd: 9: hugetlb
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 5: fd: 10: memory
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 6: fd: 11: cpuset
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 7: fd: 12: perf_event
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 8: fd: 13: devices
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 9: fd: 14: net_cls,net_prio
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 10: fd: 15: blkio
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 11: fd: 16: rdma
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: 12: fd: 17: pids
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: Kernel supports pidfds
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: Kernel supports swap accounting
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: api_extensions:
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - cgroups
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - sys_cpu_online
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_cpuinfo
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_diskstats
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_loadavg
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_meminfo
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_stat
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_swaps
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - proc_uptime
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - shared_pidns
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - cpuview_daemon
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - loadavg_daemon
Feb 08 00:14:01 ludic-ovh-uk-01-srv lxcfs[2715]: - pidfds

fiona · Feb 8, 2024

Oh right, you have to use -b-1 to get the logs from the last boot.

mstgeo · Feb 8, 2024

hmm... strange...

journalctl -b-1 -u lxcfs.service
Specifying boot ID or boot offset has no effect, no persistent journal was found.

mstgeo · Feb 8, 2024

BTW, this is the cause of the problem:

utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process

What is this exactly ?

fiona · Feb 8, 2024

mstgeo said:
hmm... strange...

journalctl -b-1 -u lxcfs.service
Specifying boot ID or boot offset has no effect, no persistent journal was found.

Seems like you don't have a persistent systemd journal configured. See man 5 journald.conf or use your favorite search engine to find out more.

mstgeo said:
BTW, this is the cause of the problem:

It might also be just another symptom.

mstgeo said:
utils.c: 254: recv_creds: Timed out waiting for scm_cred: No such process

What is this exactly ?

Haven't dug into LXC internals much, but please upgrade first and see if the issue is still there with current versions.

mstgeo · Feb 8, 2024

Is it secure to upgrade a BIG prod with +100 VMs to v7 without any issues ? We can't allow any downtime there... We'd like to fix that and migrate to NEW hardware after that.

fiona · Feb 8, 2024

I do know any details about your environment, so I cannot give you any guarantees. If you decide to upgrade, you should follow the guide closely and also look for the known issues and breaking changes:
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Known_upgrade_issues
https://pve.proxmox.com/wiki/Roadmap#7.0-breaking-changes

Search

Search

Proxmox v6.4-15 started to hang everynight...

mstgeo

Member

Attachments

fiona

Proxmox Staff Member

mstgeo

Member

fiona

Proxmox Staff Member

mstgeo

Member

fiona

Proxmox Staff Member

mstgeo

Member

fiona

Proxmox Staff Member

mstgeo

Member

fiona

Proxmox Staff Member

mstgeo

Member

mstgeo

Member

fiona

Proxmox Staff Member

mstgeo

Member

fiona

Proxmox Staff Member