Node with question mark

So im having the same issue really anoying. I already reinstalled Proxmox 3 times. Found this thread but for me its the pvestatd service which is at fault. After running for about 12 hours this messege shows up in the syslog(var/log/syslog) once every second:

Jul 7 19:25:53 rocinante pvestatd[1505]: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tools.pm line 949, <GEN1264905> chunk 1.

After a bit more time this message gets mixed with it:

Jul 7 19:26:03 rocinante kernel: [23150.449501] traps: pvestatd[20280] general protection ip:55c07c3cf856 sp:7ffd9899f4b0 error:0 in perl[55c07c2ee000+1e6000]

and at last this happens:

Jul 8 04:17:25 rocinante kernel: [55031.916102] pvestatd[1505]: segfault at 7f1dcfac0031 ip 000055c07c3dd32a sp 00007ffd9899f2c0 error 4 in perl[55c07c2ee000+1e6000]
Jul 8 04:17:25 rocinante systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Jul 8 04:17:26 rocinante systemd[1]: pvestatd.service: Unit entered failed state.
Jul 8 04:17:26 rocinante systemd[1]: pvestatd.service: Failed with result 'signal'.

I noticed this because my external metics server (influxdb) with grafana dosent get updated anymore when this happens and i get alerts on my phone in the middle of the night...

For now the solution is setting systemd to do this for the pvestatd service:
Restart=on-failure

pls mail me if you want my syslog because its bigger than 10,000kb

-Sammy

Edit:

So later that night the pvestatd service failed me again but this time there was nothing in the syslog and systemd thought it was ok until i tried restarting pvestatd. (A new record! It died after only 3h and 40min)

Jul 9 14:37:29 rocinante pvestatd[19619]: start failed - can't aquire lock '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable

I got it back up and running but for how long.
This breaks my temporary fix. If anybody knows even a temporary solution pls tell me.

-Sammy
 
Last edited:
same error again for the past month.
So far; I was only able to (temporarily, as in some days) fix a host by rebooting it.

Restarting the services as written above solves the problem for only short time (30 minutes / 1 hour )

Any other idea from the community ?

My version is
pveversion
pve-manager/5.2-2/b1d1c7f4 (running kernel: 4.15.17-3-pve)
 
We had the same issue on a cluster and foud a cause.
A failing DNS server caused it.

pvestatd serves stats to the Proxmox gui. In our case we let pvestatd export metrics to graphite.
There was no DNS at one point. pvestatd could not connect to Graphite and that caused a lot of workers and a the cosmetic question marks in the qui.
All te vm's were not affected by this issue.
 
We had the same issue on a cluster and foud a cause.
A failing DNS server caused it.

pvestatd serves stats to the Proxmox gui. In our case we let pvestatd export metrics to graphite.
There was no DNS at one point. pvestatd could not connect to Graphite and that caused a lot of workers and a the cosmetic question marks in the qui.
All te vm's were not affected by this issue.

I have the same problem , but my dns server did not "fail"
any other suggestions ?
 
Same problem.
Restarting services won't help.
I have some containers on "question" node, and I noticed that `lxc-ls` hangs.
Starting the container is also hangs.
The bug is probably in the kernel or in the lxc tools.
 
I just encountered this symptom, then realized one of my LXC containers had 100% disk space usage (ZFS subvol). I was able to resize the disk in the web GUI, then the node & all running containers/VMs were restored to the normal green "play" indicator. No reboot was necessary.

Edit:
I'm adding more info here, in case there is another problem here that's masked by my apparent solution. Nothing that I could see was in syslog/journalctl. I checked dmesg and found the following:

Code:
[May30 01:09] CIFS VFS: Server <SMB host IP redacted> has not responded in 120 seconds. Reconnecting...
[  +0.010734] CIFS VFS: Free previous auth_key.response = <redacted>
[May30 01:26] INFO: task apache2:28526 blocked for more than 120 seconds.
[  +0.000823]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000661] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000676] apache2         D    0 28526  32642 0x00000100
[  +0.000004] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_read_failed+0x10a/0x170
[  +0.000005]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000003]  lookup_slow+0x60/0x170
[  +0.000002]  ? lookup_fast+0xe8/0x300
[  +0.000001]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000002]  path_lookupat+0x73/0x220
[  +0.000003]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000004]  ? __check_object_size+0xb3/0x190
[  +0.000005]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000001]  ? user_path_at_empty+0x36/0x40
[  +0.000004]  vfs_statx+0x76/0xe0
[  +0.000001]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000006]  ? __secure_computing+0x3f/0x100
[  +0.000004]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000002]  SyS_newstat+0xe/0x10
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000002] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 00000000000006c0 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804244d0
[May30 01:54] INFO: task apache2:17245 blocked for more than 120 seconds.
[  +0.000780]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000689] apache2         D    0 17245  32642 0x00000100
[  +0.000004] Call Trace:
[  +0.000010]  __schedule+0x3e0/0x870
[  +0.000004]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_write_failed+0x208/0x390
[  +0.000002]  ? getname_flags+0x4f/0x1f0
[  +0.000004]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000002]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000004]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 000000000000001a RSI: 00007f7c59623cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000002] R13: 0000000000000010 R14: 00007f7c804244f0 R15: 00007f7c5a929c58
[May30 03:09] INFO: task apache2:28426 blocked for more than 120 seconds.
[  +0.000764]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000909] apache2         D    0 28426  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000004]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_write_failed+0x208/0x390
[  +0.000002]  ? getname_flags+0x4f/0x1f0
[  +0.000005]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000002]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000004]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 0000000000000000 RSI: 00007f7c58423cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000001] R13: 0000000000000010 R14: 00007f7c804244f0 R15: 00007f7c5a929c58
[May30 03:21] INFO: task apache2:9497 blocked for more than 120 seconds.
[  +0.000793]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000763] apache2         D    0  9497  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_read_failed+0x10a/0x170
[  +0.000005]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000003]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000001]  path_lookupat+0x73/0x220
[  +0.000004]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000004]  ? __check_object_size+0xb3/0x190
[  +0.000004]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000002]  ? user_path_at_empty+0x36/0x40
[  +0.000003]  vfs_statx+0x76/0xe0
[  +0.000001]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000006]  ? __secure_computing+0x3f/0x100
[  +0.000004]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000002]  SyS_newstat+0xe/0x10
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000002] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000002] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 00000000000001f8 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000002] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804244d0
[May30 03:41] INFO: task apache2:11059 blocked for more than 120 seconds.
[  +0.000815]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000767] apache2         D    0 11059  32642 0x00000100
[  +0.000004] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000003]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_write_failed+0x208/0x390
[  +0.000005]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000003]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000004]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 0000000000000000 RSI: 00007f7c45223cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000001] R13: 0000000000000010 R14: 00007f7c804254f0 R15: 00007f7c5a929c58
[  +0.000005] INFO: task apache2:28381 blocked for more than 120 seconds.
[  +0.000761]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000776] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000767] apache2         D    0 28381  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000002]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_write_failed+0x208/0x390
[  +0.000004]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000001]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000001] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000002] RDX: 0000000000000000 RSI: 00007f7c45023cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000001] R13: 0000000000000010 R14: 00007f7c804254f0 R15: 00007f7c5a929c58
[  +0.000002] INFO: task apache2:28383 blocked for more than 120 seconds.
[  +0.000760]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000774] apache2         D    0 28383  32642 0x00000100
[  +0.000002] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000011]  ? spl_kmem_cache_alloc+0x72/0x8c0 [spl]
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_read_failed+0x10a/0x170
[  +0.000003]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000001]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000002]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000002]  path_lookupat+0x73/0x220
[  +0.000002]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000003]  ? __check_object_size+0xb3/0x190
[  +0.000004]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000002]  ? user_path_at_empty+0x36/0x40
[  +0.000003]  vfs_statx+0x76/0xe0
[  +0.000002]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000005]  ? __secure_computing+0x3f/0x100
[  +0.000002]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000003]  SyS_newstat+0xe/0x10
[  +0.000001]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000001] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000002] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 0000000000000140 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804254d0
[  +0.000007] INFO: task apache2:7623 blocked for more than 120 seconds.
[  +0.000804]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000803] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000809] apache2         D    0  7623  32642 0x00000100
[  +0.000002] Call Trace:
[  +0.000004]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_read_failed+0x10a/0x170
[  +0.000002]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000002]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000001]  path_lookupat+0x73/0x220
[  +0.000002]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000003]  filename_lookup+0xb8/0x1a0
[  +0.000003]  ? __check_object_size+0xb3/0x190
[  +0.000002]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000001]  ? user_path_at_empty+0x36/0x40
[  +0.000002]  vfs_statx+0x76/0xe0
[  +0.000002]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000002]  ? __secure_computing+0x3f/0x100
[  +0.000002]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000003]  SyS_newstat+0xe/0x10
[  +0.000001]  do_syscall_64+0x73/0x130
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000002] RIP: 0033:0x7f7c88efd295
[  +0.000000] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000002] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 00000000000004c8 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804254d0
[  +0.000003] INFO: task apache2:9497 blocked for more than 120 seconds.
[  +0.000825]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000870] apache2         D    0  9497  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000002]  ? path_parentat+0x3e/0x80
[  +0.000001]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_write_failed+0x208/0x390
[  +0.000002]  ? getname_flags+0x4f/0x1f0
[  +0.000003]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000001]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000002]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 0000000000000000 RSI: 00007f7c59a23cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000002] R13: 0000000000000010 R14: 00007f7c804244f0 R15: 00007f7c5a929c58
[  +0.000006] INFO: task apache2:20033 blocked for more than 120 seconds.
[  +0.000857]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000893] apache2         D    0 20033  32642 0x00000100
[  +0.000002] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_read_failed+0x10a/0x170
[  +0.000002]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000002]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000001]  ? path_init+0x1bd/0x300
[  +0.000002]  path_lookupat+0x73/0x220
[  +0.000002]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000002]  ? __check_object_size+0xb3/0x190
[  +0.000003]  ? strncpy_from_user+0x4d/0x170
[  +0.000001]  user_path_at_empty+0x36/0x40
[  +0.000002]  ? user_path_at_empty+0x36/0x40
[  +0.000002]  vfs_statx+0x76/0xe0
[  +0.000001]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000003]  ? __secure_computing+0x3f/0x100
[  +0.000002]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000002]  SyS_newstat+0xe/0x10
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000001] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 0000000000000478 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804254d0
[May30 04:17] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null)
[May30 09:06] EXT4-fs (loop1): error count since last fsck: 1
[  +0.000016] EXT4-fs (loop1): initial error at time 1559037951: kmmpd:178
[  +0.000004] EXT4-fs (loop1): last error at time 1559037951: kmmpd:178

The PVE stats are blank between 3AM and 9AM (when I increased the disk size). For some more info, the offending container was running nextcloud, and I was generating image previews overnight. So, it's not really surprising that disk space blew up, but the response of Proxmox was a little concerning.
 
Last edited:
In short, no. Basically, you need to restart every single node to solve the problem, which, I believe, happens when one node has too large transfers for a long time that it disrupted the corosync communication on that node. At this point, only this node should go question-marked. However, a bug with corosync 2.4.2 (fixed in 2.4.3) might be the reason that brought down the cluster. I filed a bug report to Proxmox earlier and the dev said they plan to upgrade corosync to 2.4.4 "soon".

I'm not entirely sure that bug is the cause of the problem. But, nevertheless, it has to be something with corosync. So I guess, we might just want to wait for the 2.4.4 update.

BTW: I now manually limit transfer, e.g. to 95Mbps on bottleneck servers, and now the problem is rare (happened a few times but self-healed quickly).
2.4.4 I have the same Problem
 
I have the same problem. It is easy to reproduce: when my nfs backup server does not respond I need to restart all pvestatsd every few minutes...
 
Hi,
I had this problem () with one node in the cluster, it happened when we stop a backup task, and next tried start a lxc container.
This process was zombi in the node.
I killed the process and now all it is working.
\_ /usr/bin/perl /usr/share/lxc/hooks/lxc-pve-prestart-hook 104 lxc pre-start
Thank you all
These posts help me.
https://forum.proxmox.com/threads/c...-a-strange-state-containers-greyed-out.46650/
https://bugzilla.proxmox.com/show_bug.cgi?id=1943
https://forum.proxmox.com/threads/lxc-vm-startup-fails-in-lxc-pve-prestart-hook.45190/
 
Happened to me too once again. pvestatd should really be multithread service so that it would not bog itself down if one of the metrics are not responding. For me I saw that VGS command hung for some reason.
 
Same error here every 1 week :(
Mine is doing that too, for some reason pvestatd is crashing, causing the issue. I'm still researching the cause.

I wrote the following script to keep the daemon alive and slapped it in crontab. kinda like a watchdog, but without having to configure that beast.

Bash:
#!/bin/bash
#  Keep the PVE STAT DAEMON running, to prevent the grey question mark in the UI
#############################################################

# Check status of the daemon
STATUS=$(pvestatd status)

if [[ "running" != "${STATUS}" ]]; then
   pvestatd start
fi

and the crontab looks like
Code:
*/15 * * * * /root/bin/keep-alive-pvestatd
 
Mine is doing that too, for some reason pvestatd is crashing, causing the issue. I'm still researching the cause.

I wrote the following script to keep the daemon alive and slapped it in crontab. kinda like a watchdog, but without having to configure that beast.

Bash:
#!/bin/bash
#  Keep the PVE STAT DAEMON running, to prevent the grey question mark in the UI
#############################################################

# Check status of the daemon
STATUS=$(pvestatd status)

if [[ "running" != "${STATUS}" ]]; then
   pvestatd start
fi

and the crontab looks like
Code:
*/15 * * * * /root/bin/keep-alive-pvestatd
Sadly for me it was a real hardware issue
 
I have the same problem! Now my clusters will crash almost every 2 weeks for no reason, the only way to resolve this is to rebooting the servers, anyone can help now?
 
I have seen the same problem. After trying out a few things from this thread and throught it may have something to do with some storage that lockes up.
So i thought i'll try to restart the Proxmox backup server. Just after I've clicked on Start, it went back to normal.
I have no cluster, single node but the backup manager is confiugred on a second node aswell that has shown the same problem today.
First I thought it may be related with the "Update package database" that runs every night, because the graphs in the interface just stop at exactly that time where it gets executed (4AM on one node, 5.30AM on the other one).
So if you have the Proxmox Backup Server in use, you can give it a shot to try to restart it, it may (or may not) help. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!