Node with question mark

speatzle_

Member
Nov 4, 2017
2
0
6
19
So im having the same issue really anoying. I already reinstalled Proxmox 3 times. Found this thread but for me its the pvestatd service which is at fault. After running for about 12 hours this messege shows up in the syslog(var/log/syslog) once every second:

Jul 7 19:25:53 rocinante pvestatd[1505]: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tools.pm line 949, <GEN1264905> chunk 1.

After a bit more time this message gets mixed with it:

Jul 7 19:26:03 rocinante kernel: [23150.449501] traps: pvestatd[20280] general protection ip:55c07c3cf856 sp:7ffd9899f4b0 error:0 in perl[55c07c2ee000+1e6000]

and at last this happens:

Jul 8 04:17:25 rocinante kernel: [55031.916102] pvestatd[1505]: segfault at 7f1dcfac0031 ip 000055c07c3dd32a sp 00007ffd9899f2c0 error 4 in perl[55c07c2ee000+1e6000]
Jul 8 04:17:25 rocinante systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Jul 8 04:17:26 rocinante systemd[1]: pvestatd.service: Unit entered failed state.
Jul 8 04:17:26 rocinante systemd[1]: pvestatd.service: Failed with result 'signal'.

I noticed this because my external metics server (influxdb) with grafana dosent get updated anymore when this happens and i get alerts on my phone in the middle of the night...

For now the solution is setting systemd to do this for the pvestatd service:
Restart=on-failure

pls mail me if you want my syslog because its bigger than 10,000kb

-Sammy

Edit:

So later that night the pvestatd service failed me again but this time there was nothing in the syslog and systemd thought it was ok until i tried restarting pvestatd. (A new record! It died after only 3h and 40min)

Jul 9 14:37:29 rocinante pvestatd[19619]: start failed - can't aquire lock '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable

I got it back up and running but for how long.
This breaks my temporary fix. If anybody knows even a temporary solution pls tell me.

-Sammy
 
Last edited:

A71

New Member
Feb 22, 2018
4
0
1
29
same error again for the past month.
So far; I was only able to (temporarily, as in some days) fix a host by rebooting it.

Restarting the services as written above solves the problem for only short time (30 minutes / 1 hour )

Any other idea from the community ?

My version is
pveversion
pve-manager/5.2-2/b1d1c7f4 (running kernel: 4.15.17-3-pve)
 
Jan 3, 2014
31
0
6
Ede, NL
www.tuxis.nl
We had the same issue on a cluster and foud a cause.
A failing DNS server caused it.

pvestatd serves stats to the Proxmox gui. In our case we let pvestatd export metrics to graphite.
There was no DNS at one point. pvestatd could not connect to Graphite and that caused a lot of workers and a the cosmetic question marks in the qui.
All te vm's were not affected by this issue.
 

chchang

New Member
Feb 6, 2018
25
3
3
41
We had the same issue on a cluster and foud a cause.
A failing DNS server caused it.

pvestatd serves stats to the Proxmox gui. In our case we let pvestatd export metrics to graphite.
There was no DNS at one point. pvestatd could not connect to Graphite and that caused a lot of workers and a the cosmetic question marks in the qui.
All te vm's were not affected by this issue.
I have the same problem , but my dns server did not "fail"
any other suggestions ?
 
Mar 22, 2018
3
0
1
36
Same problem.
Restarting services won't help.
I have some containers on "question" node, and I noticed that `lxc-ls` hangs.
Starting the container is also hangs.
The bug is probably in the kernel or in the lxc tools.
 

Elliott Partridge

New Member
Oct 7, 2018
6
1
3
33
I just encountered this symptom, then realized one of my LXC containers had 100% disk space usage (ZFS subvol). I was able to resize the disk in the web GUI, then the node & all running containers/VMs were restored to the normal green "play" indicator. No reboot was necessary.

Edit:
I'm adding more info here, in case there is another problem here that's masked by my apparent solution. Nothing that I could see was in syslog/journalctl. I checked dmesg and found the following:

Code:
[May30 01:09] CIFS VFS: Server <SMB host IP redacted> has not responded in 120 seconds. Reconnecting...
[  +0.010734] CIFS VFS: Free previous auth_key.response = <redacted>
[May30 01:26] INFO: task apache2:28526 blocked for more than 120 seconds.
[  +0.000823]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000661] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000676] apache2         D    0 28526  32642 0x00000100
[  +0.000004] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_read_failed+0x10a/0x170
[  +0.000005]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000003]  lookup_slow+0x60/0x170
[  +0.000002]  ? lookup_fast+0xe8/0x300
[  +0.000001]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000002]  path_lookupat+0x73/0x220
[  +0.000003]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000004]  ? __check_object_size+0xb3/0x190
[  +0.000005]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000001]  ? user_path_at_empty+0x36/0x40
[  +0.000004]  vfs_statx+0x76/0xe0
[  +0.000001]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000006]  ? __secure_computing+0x3f/0x100
[  +0.000004]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000002]  SyS_newstat+0xe/0x10
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000002] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 00000000000006c0 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804244d0
[May30 01:54] INFO: task apache2:17245 blocked for more than 120 seconds.
[  +0.000780]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000689] apache2         D    0 17245  32642 0x00000100
[  +0.000004] Call Trace:
[  +0.000010]  __schedule+0x3e0/0x870
[  +0.000004]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_write_failed+0x208/0x390
[  +0.000002]  ? getname_flags+0x4f/0x1f0
[  +0.000004]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000002]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000004]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 000000000000001a RSI: 00007f7c59623cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000002] R13: 0000000000000010 R14: 00007f7c804244f0 R15: 00007f7c5a929c58
[May30 03:09] INFO: task apache2:28426 blocked for more than 120 seconds.
[  +0.000764]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000909] apache2         D    0 28426  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000004]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_write_failed+0x208/0x390
[  +0.000002]  ? getname_flags+0x4f/0x1f0
[  +0.000005]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000002]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000004]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 0000000000000000 RSI: 00007f7c58423cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000001] R13: 0000000000000010 R14: 00007f7c804244f0 R15: 00007f7c5a929c58
[May30 03:21] INFO: task apache2:9497 blocked for more than 120 seconds.
[  +0.000793]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000763] apache2         D    0  9497  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_read_failed+0x10a/0x170
[  +0.000005]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000003]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000001]  path_lookupat+0x73/0x220
[  +0.000004]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000004]  ? __check_object_size+0xb3/0x190
[  +0.000004]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000002]  ? user_path_at_empty+0x36/0x40
[  +0.000003]  vfs_statx+0x76/0xe0
[  +0.000001]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000006]  ? __secure_computing+0x3f/0x100
[  +0.000004]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000002]  SyS_newstat+0xe/0x10
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000002] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000002] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 00000000000001f8 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000002] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804244d0
[May30 03:41] INFO: task apache2:11059 blocked for more than 120 seconds.
[  +0.000815]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000767] apache2         D    0 11059  32642 0x00000100
[  +0.000004] Call Trace:
[  +0.000011]  __schedule+0x3e0/0x870
[  +0.000003]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000003]  rwsem_down_write_failed+0x208/0x390
[  +0.000005]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000003]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000004]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000003] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 0000000000000000 RSI: 00007f7c45223cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000001] R13: 0000000000000010 R14: 00007f7c804254f0 R15: 00007f7c5a929c58
[  +0.000005] INFO: task apache2:28381 blocked for more than 120 seconds.
[  +0.000761]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000776] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000767] apache2         D    0 28381  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000002]  ? path_parentat+0x3e/0x80
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_write_failed+0x208/0x390
[  +0.000004]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000001]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000001] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000002] RDX: 0000000000000000 RSI: 00007f7c45023cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000001] R13: 0000000000000010 R14: 00007f7c804254f0 R15: 00007f7c5a929c58
[  +0.000002] INFO: task apache2:28383 blocked for more than 120 seconds.
[  +0.000760]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000774] apache2         D    0 28383  32642 0x00000100
[  +0.000002] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000011]  ? spl_kmem_cache_alloc+0x72/0x8c0 [spl]
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_read_failed+0x10a/0x170
[  +0.000003]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000001]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000002]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000002]  path_lookupat+0x73/0x220
[  +0.000002]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000003]  ? __check_object_size+0xb3/0x190
[  +0.000004]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000002]  ? user_path_at_empty+0x36/0x40
[  +0.000003]  vfs_statx+0x76/0xe0
[  +0.000002]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000005]  ? __secure_computing+0x3f/0x100
[  +0.000002]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000003]  SyS_newstat+0xe/0x10
[  +0.000001]  do_syscall_64+0x73/0x130
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000001] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000002] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 0000000000000140 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804254d0
[  +0.000007] INFO: task apache2:7623 blocked for more than 120 seconds.
[  +0.000804]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000803] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000809] apache2         D    0  7623  32642 0x00000100
[  +0.000002] Call Trace:
[  +0.000004]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_read_failed+0x10a/0x170
[  +0.000002]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000002]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000002]  ? path_init+0x1bd/0x300
[  +0.000001]  path_lookupat+0x73/0x220
[  +0.000002]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000003]  filename_lookup+0xb8/0x1a0
[  +0.000003]  ? __check_object_size+0xb3/0x190
[  +0.000002]  ? strncpy_from_user+0x4d/0x170
[  +0.000002]  user_path_at_empty+0x36/0x40
[  +0.000001]  ? user_path_at_empty+0x36/0x40
[  +0.000002]  vfs_statx+0x76/0xe0
[  +0.000002]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000002]  ? __secure_computing+0x3f/0x100
[  +0.000002]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000003]  SyS_newstat+0xe/0x10
[  +0.000001]  do_syscall_64+0x73/0x130
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000002] RIP: 0033:0x7f7c88efd295
[  +0.000000] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000002] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 00000000000004c8 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804254d0
[  +0.000003] INFO: task apache2:9497 blocked for more than 120 seconds.
[  +0.000825]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000870] apache2         D    0  9497  32642 0x00000100
[  +0.000003] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000002]  ? path_parentat+0x3e/0x80
[  +0.000001]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_write_failed+0x208/0x390
[  +0.000002]  ? getname_flags+0x4f/0x1f0
[  +0.000003]  call_rwsem_down_write_failed+0x17/0x30
[  +0.000001]  ? call_rwsem_down_write_failed+0x17/0x30
[  +0.000002]  down_write+0x2d/0x40
[  +0.000002]  do_unlinkat+0x1a5/0x310
[  +0.000002]  SyS_unlink+0x1f/0x30
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88eff0e7
[  +0.000001] RSP: 002b:00007ffe736f8038 EFLAGS: 00000217 ORIG_RAX: 0000000000000057
[  +0.000002] RAX: ffffffffffffffda RBX: 00007f7c805fede0 RCX: 00007f7c88eff0e7
[  +0.000001] RDX: 0000000000000000 RSI: 00007f7c59a23cc8 RDI: 00007ffe736f8040
[  +0.000001] RBP: 00007ffe736f9100 R08: 000000000000c1de R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000217 R12: 00007f7c85f36460
[  +0.000002] R13: 0000000000000010 R14: 00007f7c804244f0 R15: 00007f7c5a929c58
[  +0.000006] INFO: task apache2:20033 blocked for more than 120 seconds.
[  +0.000857]       Tainted: P          IO     4.15.18-14-pve #1
[  +0.000866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000893] apache2         D    0 20033  32642 0x00000100
[  +0.000002] Call Trace:
[  +0.000003]  __schedule+0x3e0/0x870
[  +0.000002]  schedule+0x36/0x80
[  +0.000002]  rwsem_down_read_failed+0x10a/0x170
[  +0.000002]  call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  ? call_rwsem_down_read_failed+0x18/0x30
[  +0.000002]  down_read+0x20/0x40
[  +0.000002]  lookup_slow+0x60/0x170
[  +0.000001]  ? lookup_fast+0xe8/0x300
[  +0.000002]  walk_component+0x1c5/0x360
[  +0.000001]  ? path_init+0x1bd/0x300
[  +0.000002]  path_lookupat+0x73/0x220
[  +0.000002]  ? profile_path_perm.part.7+0x78/0xa0
[  +0.000002]  filename_lookup+0xb8/0x1a0
[  +0.000002]  ? __check_object_size+0xb3/0x190
[  +0.000003]  ? strncpy_from_user+0x4d/0x170
[  +0.000001]  user_path_at_empty+0x36/0x40
[  +0.000002]  ? user_path_at_empty+0x36/0x40
[  +0.000002]  vfs_statx+0x76/0xe0
[  +0.000001]  ? memzero_explicit+0x12/0x20
[  +0.000002]  SYSC_newstat+0x3d/0x70
[  +0.000003]  ? __secure_computing+0x3f/0x100
[  +0.000002]  ? syscall_trace_enter+0xca/0x2e0
[  +0.000002]  SyS_newstat+0xe/0x10
[  +0.000002]  do_syscall_64+0x73/0x130
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  +0.000001] RIP: 0033:0x7f7c88efd295
[  +0.000001] RSP: 002b:00007ffe736f7d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  +0.000001] RAX: ffffffffffffffda RBX: 00007ffe736f7de0 RCX: 00007f7c88efd295
[  +0.000001] RDX: 00007ffe736f7d50 RSI: 00007ffe736f7d50 RDI: 00007ffe736f7de0
[  +0.000001] RBP: 0000000000000002 R08: 000000000000c1de R09: 0000000000000005
[  +0.000001] R10: 0000000000000478 R11: 0000000000000246 R12: 00007ffe736f8e00
[  +0.000001] R13: 000000000000000c R14: 00007ffe736f9040 R15: 00007f7c804254d0
[May30 04:17] EXT4-fs (loop2): mounted filesystem with ordered data mode. Opts: (null)
[May30 09:06] EXT4-fs (loop1): error count since last fsck: 1
[  +0.000016] EXT4-fs (loop1): initial error at time 1559037951: kmmpd:178
[  +0.000004] EXT4-fs (loop1): last error at time 1559037951: kmmpd:178
The PVE stats are blank between 3AM and 9AM (when I increased the disk size). For some more info, the offending container was running nextcloud, and I was generating image previews overnight. So, it's not really surprising that disk space blew up, but the response of Proxmox was a little concerning.
 
Last edited:

Marcos Mendez

Member
May 19, 2017
20
0
6
28
São Paulo / Brasil
popsolutions.co
In short, no. Basically, you need to restart every single node to solve the problem, which, I believe, happens when one node has too large transfers for a long time that it disrupted the corosync communication on that node. At this point, only this node should go question-marked. However, a bug with corosync 2.4.2 (fixed in 2.4.3) might be the reason that brought down the cluster. I filed a bug report to Proxmox earlier and the dev said they plan to upgrade corosync to 2.4.4 "soon".

I'm not entirely sure that bug is the cause of the problem. But, nevertheless, it has to be something with corosync. So I guess, we might just want to wait for the 2.4.4 update.

BTW: I now manually limit transfer, e.g. to 95Mbps on bottleneck servers, and now the problem is rare (happened a few times but self-healed quickly).
2.4.4 I have the same Problem
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!