I have a 5 node system where only 1 - 3 have any workload, a mixture of imported KVM VM's and new containers.
This morning something wasn't right with one of the containers which was responsible for a couple of intranet websites. As we couldn't ssh to the container we went to look to see if it was running which was done via one of the other nodes.
Oddly enough the container that had died was 123 but the entire node was all messed up. I was not able to get on to the node, SSH or otherwise but using the GUI I was able to issue a reboot of that node.
The reboot took about 8 minutes - and looking at the syslog that is because it was nicely shutting down all the workloads for a the reboot, after which everything sprang back to life just fine.
So it looks like 123 crashed and partially took out the node.
The node was recently updated to 8.1.4 and the backup server was also recently updated to 3.1.4 so we are not talking about anything old, last fully updated under a week ago and they have all had a reboot to use the newly updated kernel.
I checked all the VM's and containers and ONLY 123 shows a syslog gap (other than the reboot) it basically died at 1:01 am and the reboot was 7:33. There is nothing in the container logs, it just looks like its frozen. (which it does when you back it up).
On the proxmox syslog - even after it froze I can't see anything too unusual, it carried on running but there is an odd kernel message:
Which looks like stack trace information - but I could be barking up the wrong tree here. The CT in question has a PHP and NFS, the latter cause for privileged container.
That container wasn't due to backup at 1am and may just have been what MY pc last saw before it went into standby.
In particular, I'd like to know if I could have done something other than reboot the node as all but one VM's were running ok.
The nodes replicate to each other via a dedicated replication network.
So... I don't have any other clues as to where to look, I actually don't think it was backing up I think it just crashed and put the node in an unstable state.
If you can think of any logs I might be able to get to find any more information please let me know.
This morning something wasn't right with one of the containers which was responsible for a couple of intranet websites. As we couldn't ssh to the container we went to look to see if it was running which was done via one of the other nodes.
Oddly enough the container that had died was 123 but the entire node was all messed up. I was not able to get on to the node, SSH or otherwise but using the GUI I was able to issue a reboot of that node.
The reboot took about 8 minutes - and looking at the syslog that is because it was nicely shutting down all the workloads for a the reboot, after which everything sprang back to life just fine.
So it looks like 123 crashed and partially took out the node.
The node was recently updated to 8.1.4 and the backup server was also recently updated to 3.1.4 so we are not talking about anything old, last fully updated under a week ago and they have all had a reboot to use the newly updated kernel.
I checked all the VM's and containers and ONLY 123 shows a syslog gap (other than the reboot) it basically died at 1:01 am and the reboot was 7:33. There is nothing in the container logs, it just looks like its frozen. (which it does when you back it up).
On the proxmox syslog - even after it froze I can't see anything too unusual, it carried on running but there is an odd kernel message:
Code:
2024-03-13T01:02:05.776254+00:00 proxmoxy3 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
2024-03-13T01:02:05.776819+00:00 proxmoxy3 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
2024-03-13T01:02:05.776882+00:00 proxmoxy3 systemd[1]: user-0.slice: Consumed 4.450s CPU time.
2024-03-13T01:02:06.679004+00:00 proxmoxy3 kernel: [388842.869608] Tainted: P O 6.5.13-1-pve #1
2024-03-13T01:02:06.679019+00:00 proxmoxy3 kernel: [388842.870281] Call Trace:
2024-03-13T01:02:06.679020+00:00 proxmoxy3 kernel: [388842.871114] ? __pfx_nfs_do_lookup_revalidate+0x10/0x10 [nfs]
2024-03-13T01:02:06.679021+00:00 proxmoxy3 kernel: [388842.871739] ? __pfx_var_wake_function+0x10/0x10
2024-03-13T01:02:06.679021+00:00 proxmoxy3 kernel: [388842.872684] filename_lookup+0xe4/0x200
2024-03-13T01:02:06.679023+00:00 proxmoxy3 kernel: [388842.872864] ? __pfx_zpl_put_link+0x10/0x10 [zfs]
2024-03-13T01:02:06.682810+00:00 proxmoxy3 kernel: [388842.873468] vfs_statx+0xa1/0x180
2024-03-13T01:02:06.682814+00:00 proxmoxy3 kernel: [388842.873645] vfs_fstatat+0x58/0x80
2024-03-13T01:02:06.682815+00:00 proxmoxy3 kernel: [388842.873819] __do_sys_newfstatat+0x44/0x90
2024-03-13T01:02:06.682815+00:00 proxmoxy3 kernel: [388842.874005] __x64_sys_newfstatat+0x1c/0x30
2024-03-13T01:02:06.682816+00:00 proxmoxy3 kernel: [388842.874532] ? exit_to_user_mode_prepare+0x39/0x190
2024-03-13T01:02:06.682817+00:00 proxmoxy3 kernel: [388842.874707] ? syscall_exit_to_user_mode+0x37/0x60
2024-03-13T01:04:07.511182+00:00 proxmoxy3 kernel: [388963.702989] Tainted: P O 6.5.13-1-pve #1
2024-03-13T01:04:07.511201+00:00 proxmoxy3 kernel: [388963.704802] <TASK>
2024-03-13T01:04:07.514838+00:00 proxmoxy3 kernel: [388963.707534] ? __pfx_var_wake_function+0x10/0x10
2024-03-13T01:04:07.518821+00:00 proxmoxy3 kernel: [388963.710719] ? strncpy_from_user+0x50/0x170
2024-03-13T01:04:07.518825+00:00 proxmoxy3 kernel: [388963.711123] vfs_statx+0xa1/0x180
2024-03-13T01:04:07.518826+00:00 proxmoxy3 kernel: [388963.711522] vfs_fstatat+0x58/0x80
2024-03-13T01:04:07.518827+00:00 proxmoxy3 kernel: [388963.711907] __do_sys_newfstatat+0x44/0x90
2024-03-13T01:04:07.518827+00:00 proxmoxy3 kernel: [388963.712699] do_syscall_64+0x58/0x90
2024-03-13T01:04:07.518829+00:00 proxmoxy3 kernel: [388963.713490] ? exit_to_user_mode_prepare+0x39/0x190
2024-03-13T01:04:07.518829+00:00 proxmoxy3 kernel: [388963.714289] ? do_syscall_64+0x67/0x90
2024-03-13T01:04:07.522873+00:00 proxmoxy3 kernel: [388963.714688] ? do_syscall_64+0x67/0x90
2024-03-13T01:04:07.522877+00:00 proxmoxy3 kernel: [388963.715082] ? do_syscall_64+0x67/0x90
2024-03-13T01:04:07.522877+00:00 proxmoxy3 kernel: [388963.715471] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
2024-03-13T01:04:07.522878+00:00 proxmoxy3 kernel: [388963.716683] RAX: ffffffffffffffda RBX: 00007ffc140329b0 RCX: 0000759d41e91d3e
2024-03-13T01:06:08.342827+00:00 proxmoxy3 kernel: [389084.535889] Tainted: P O 6.5.13-1-pve #1
2024-03-13T01:06:08.342845+00:00 proxmoxy3 kernel: [389084.536762] <TASK>
2024-03-13T01:06:08.342846+00:00 proxmoxy3 kernel: [389084.538335] lookup_fast+0x80/0x100
2024-03-13T01:06:08.342847+00:00 proxmoxy3 kernel: [389084.538870] filename_lookup+0xe4/0x200
2024-03-13T01:06:08.346809+00:00 proxmoxy3 kernel: [389084.540776] ? syscall_exit_to_user_mode+0x37/0x60
2024-03-13T01:06:08.346813+00:00 proxmoxy3 kernel: [389084.541280] ? do_syscall_64+0x67/0x90
2024-03-13T01:06:08.346814+00:00 proxmoxy3 kernel: [389084.542156] RDX: 00007ffc140328a0 RSI: 000057c8e7645b40 RDI: 00000000ffffff9c
2024-03-13T01:06:08.346814+00:00 proxmoxy3 kernel: [389084.542330] RBP: 000057c8e7643af0 R08: 000057c8e7646d90 R09: 0000000000000000
2024-03-13T01:06:08.346815+00:00 proxmoxy3 kernel: [389084.542862] </TASK>
2024-03-13T01:08:09.174886+00:00 proxmoxy3 kernel: [389205.368701] INFO: task php8.1:531710 blocked for more than 1087 seconds.
2024-03-13T01:08:09.174901+00:00 proxmoxy3 kernel: [389205.369014] Tainted: P O 6.5.13-1-pve #1
2024-03-13T01:08:09.174902+00:00 proxmoxy3 kernel: [389205.369255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-03-13T01:08:09.174902+00:00 proxmoxy3 kernel: [389205.369439] task:php8.1 state:D stack:0 pid:531710 ppid:22715 flags:0x00000006
2024-03-13T01:08:09.174903+00:00 proxmoxy3 kernel: [389205.369625] Call Trace:
2024-03-13T01:08:09.174904+00:00 proxmoxy3 kernel: [389205.369802] <TASK>
2024-03-13T01:08:09.174905+00:00 proxmoxy3 kernel: [389205.369976] __schedule+0x3fc/0x1440
2024-03-13T01:08:09.174906+00:00 proxmoxy3 kernel: [389205.370157] ? nfs_access_get_cached+0xd2/0x280 [nfs]
2024-03-13T01:08:09.174906+00:00 proxmoxy3 kernel: [389205.370359] ? __pfx_nfs_do_lookup_revalidate+0x10/0x10 [nfs]
2024-03-13T01:08:09.174907+00:00 proxmoxy3 kernel: [389205.370554] schedule+0x63/0x110
2024-03-13T01:08:09.174910+00:00 proxmoxy3 kernel: [389205.370725] __nfs_lookup_revalidate+0x107/0x140 [nfs]
2024-03-13T01:08:09.174920+00:00 proxmoxy3 kernel: [389205.370912] ? __pfx_var_wake_function+0x10/0x10
2024-03-13T01:08:09.174920+00:00 proxmoxy3 kernel: [389205.371080] nfs_lookup_revalidate+0x15/0x30 [nfs]
2024-03-13T01:08:09.174921+00:00 proxmoxy3 kernel: [389205.371260] lookup_fast+0x80/0x100
2024-03-13T01:08:09.174921+00:00 proxmoxy3 kernel: [389205.371425] walk_component+0x2c/0x190
2024-03-13T01:08:09.174921+00:00 proxmoxy3 kernel: [389205.371591] path_lookupat+0x67/0x1a0
2024-03-13T01:08:09.174923+00:00 proxmoxy3 kernel: [389205.371755] filename_lookup+0xe4/0x200
2024-03-13T01:08:09.174923+00:00 proxmoxy3 kernel: [389205.371918] ? __pfx_zpl_put_link+0x10/0x10 [zfs]
2024-03-13T01:08:09.174924+00:00 proxmoxy3 kernel: [389205.372250] ? strncpy_from_user+0x50/0x170
2024-03-13T01:08:09.174924+00:00 proxmoxy3 kernel: [389205.372413] vfs_statx+0xa1/0x180
2024-03-13T01:08:09.174924+00:00 proxmoxy3 kernel: [389205.372594] vfs_fstatat+0x58/0x80
2024-03-13T01:08:09.178810+00:00 proxmoxy3 kernel: [389205.373117] do_syscall_64+0x58/0x90
2024-03-13T01:08:09.178813+00:00 proxmoxy3 kernel: [389205.373816] ? do_syscall_64+0x67/0x90
2024-03-13T01:08:09.178814+00:00 proxmoxy3 kernel: [389205.374507] RIP: 0033:0x759d41e91d3e
2024-03-13T01:08:09.178814+00:00 proxmoxy3 kernel: [389205.374706] RSP: 002b:00007ffc14032808 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
2024-03-13T01:08:09.178815+00:00 proxmoxy3 kernel: [389205.375236] RBP: 000057c8e7643af0 R08: 000057c8e7646d90 R09: 0000000000000000
2024-03-13T01:08:09.178815+00:00 proxmoxy3 kernel: [389205.375770] </TASK>
2024-03-13T01:10:10.006843+00:00 proxmoxy3 kernel: [389326.202389] Tainted: P O 6.5.13-1-pve #1
2024-03-13T01:10:10.006857+00:00 proxmoxy3 kernel: [389326.204266] <TASK>
2024-03-13T01:10:10.014804+00:00 proxmoxy3 kernel: [389326.213244] ? exit_to_user_mode_prepare+0x39/0x190
2024-03-13T01:10:10.014808+00:00 proxmoxy3 kernel: [389326.213656] ? syscall_exit_to_user_mode+0x37/0x60
2024-03-13T01:10:10.019079+00:00 proxmoxy3 kernel: [389326.216617] RAX: ffffffffffffffda RBX: 00007ffc140329b0 RCX: 0000759d41e91d3e
2024-03-13T01:10:10.022811+00:00 proxmoxy3 kernel: [389326.218375] R13: 000057c8e7645b40 R14: 0000759d3fa15e10 R15: 000057c8e7645b40
2024-03-13T01:10:10.022814+00:00 proxmoxy3 kernel: [389326.218819] </TASK>
2024-03-13T01:15:04.051044+00:00 proxmoxy3 systemd[1]: Created slice user-0.slice - User Slice of UID 0.
2024-03-13T01:15:04.091042+00:00 proxmoxy3 systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
2024-03-13T01:15:04.096422+00:00 proxmoxy3 systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
2024-03-13T01:15:04.097741+00:00 proxmoxy3 systemd[1]: Starting user@0.service - User Manager for UID 0...
2024-03-13T01:15:04.324844+00:00 proxmoxy3 systemd[3882606]: Queued start job for default target default.target.
Which looks like stack trace information - but I could be barking up the wrong tree here. The CT in question has a PHP and NFS, the latter cause for privileged container.
That container wasn't due to backup at 1am and may just have been what MY pc last saw before it went into standby.
In particular, I'd like to know if I could have done something other than reboot the node as all but one VM's were running ok.
The nodes replicate to each other via a dedicated replication network.
So... I don't have any other clues as to where to look, I actually don't think it was backing up I think it just crashed and put the node in an unstable state.
If you can think of any logs I might be able to get to find any more information please let me know.
Last edited: