Proxmox looses SMB/CIFS connection

fiona · Mar 17, 2023

Hi,

quanto11 said:
Hey Guys,

I actually have the same problem, but not with a VM but with a container. I have also had the problem for some time that "bad crc/signature" messages keep appearing in the Proxmox syslog. This is actually due to the fact that the container mounts various file server shares and then loses them.

Nobody else mentioned "bad crc" errors in this thread? Sounds like a completely different problem

quanto11 said:
Container: Ubuntu 20.04.6 on proxmox-ve: 7.3-1 (running kernel: 6.1.2-1-pve)

root@TestContainer:~# dmesg
[1016351.343569] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.344417] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.344420] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.344421] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.344422] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.344427] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.345072] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.345074] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.345075] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.345075] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.345079] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.345519] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.345520] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.345521] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.345522] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.345524] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.345826] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.345827] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.345828] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.345829] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.345830] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.346117] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.346118] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.346119] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.346120] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.346122] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.346393] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.346395] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.346395] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.346396] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.346398] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.346679] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.346683] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.346684] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.346686] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.346689] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.346964] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.346967] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.346969] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.346971] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.346975] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.347318] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347320] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347321] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347322] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.347326] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
[1016351.347677] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347679] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347681] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347683] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.347686] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347688] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347690] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347691] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.347694] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347695] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347698] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347699] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.347702] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347703] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347704] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347705] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.347708] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347709] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347710] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347712] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.347714] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347715] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347717] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347718] 00000030: 00000000 00000000 00000000 00000000 ................
[1016351.347720] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1016351.347721] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1016351.347723] 00000020: 00000000 00000000 00000000 00000000 ................
[1016351.347724] 00000030: 00000000 00000000 00000000 00000000 ................
[1016566.232409] libceph: osd2 (1)10.26.15.50:6888 socket closed (con state OPEN)
[1016751.347292] CIFS: __readahead_batch() returned 3/1024
[1016813.917247] CIFS: __readahead_batch() returned 36/1024
[1016915.274721] CIFS: __readahead_batch() returned 624/1024
[1016919.365776] CIFS: __readahead_batch() returned 320/1024
[1016923.003015] CIFS: __readahead_batch() returned 595/1024
[1016926.848117] CIFS: __readahead_batch() returned 467/1024
[1016930.514902] CIFS: __readahead_batch() returned 798/1024
[1016941.732703] CIFS: __readahead_batch() returned 676/1024
[1017072.537382] CIFS: __readahead_batch() returned 676/1024
[1017119.732143] libceph: osd2 (1)10.26.15.50:6888 socket closed (con state OPEN)
[1017140.806811] CIFS: __readahead_batch() returned 190/1024
[1017147.750476] CIFS: __readahead_batch() returned 160/1024
[1017151.511786] CIFS: __readahead_batch() returned 32/1024
[1017155.065823] CIFS: __readahead_batch() returned 843/1024
[1017162.767514] CIFS: __readahead_batch() returned 676/1024
[1017233.937046] libceph: read_partial_message 00000000f4c76d40 data crc 3384998954 != exp. 3952744336
[1017233.937056] libceph: read_partial_message 00000000c9d36e25 data crc 1459313744 != exp. 2082328904
[1017233.937069] libceph: osd10 (1)10.26.15.52:6804 bad crc/signature
[1017233.937486] libceph: osd9 (1)10.26.15.52:6835 bad crc/signature
[1017234.008573] libceph: read_partial_message 00000000b99ce386 data crc 2146006423 != exp. 94080286
[1017234.008592] libceph: read_partial_message 00000000dc8d021a data crc 1675959117 != exp. 3565934067
[1017234.008942] libceph: osd10 (1)10.26.15.52:6804 bad crc/signature
[1017234.009280] libceph: osd1 (1)10.26.15.50:6810 bad crc/signature
[1017235.084560] libceph: read_partial_message 000000008a4d8d56 data crc 3794373310 != exp. 3796242444
[1017235.084800] libceph: osd1 (1)10.26.15.50:6810 bad crc/signature
[1017235.199896] libceph: read_partial_message 0000000074a0535d data crc 3061441865 != exp. 1489401521
[1017235.199941] libceph: read_partial_message 0000000003a40739 data crc 1365164881 != exp. 238705880
[1017235.200165] libceph: osd3 (1)10.26.15.50:6858 bad crc/signature
[1017235.201002] libceph: osd10 (1)10.26.15.52:6804 bad crc/signature
[1017238.745290] libceph: read_partial_message 0000000070aa279f data crc 1034986088 != exp. 2701370662
[1017238.745559] libceph: osd1 (1)10.26.15.50:6810 bad crc/signature
[1017238.781498] libceph: read_partial_message 00000000ff8fb7dd data crc 2762722003 != exp. 2418349765
[1017238.781505] libceph: read_partial_message 000000002869c2ac data crc 4099414400 != exp. 2039625718
[1017238.781952] libceph: osd2 (1)10.26.15.50:6888 bad crc/signature
[1017238.782261] libceph: osd7 (1)10.26.15.51:6809 bad crc/signature
[1017238.898502] libceph: read_partial_message 00000000ff8fb7dd data crc 2559528101 != exp. 228822481
[1017238.898916] libceph: osd3 (1)10.26.15.50:6858 bad crc/signature
[1017243.300474] libceph: read_partial_message 00000000641bf45d data crc 2796105785 != exp. 472079829
[1017243.300504] libceph: read_partial_message 00000000442bd182 data crc 1721631902 != exp. 4230074893
[1017243.301108] libceph: osd6 (1)10.26.15.51:6839 bad crc/signature
[1017243.301946] libceph: osd0 (1)10.26.15.50:6834 bad crc/signature
[1017244.157516] libceph: read_partial_message 00000000cd2914bc data crc 2139303003 != exp. 971856957
[1017244.157522] libceph: read_partial_message 00000000cb3187d9 data crc 3966178819 != exp. 901871706
[1017244.157827] libceph: osd6 (1)10.26.15.51:6839 bad crc/signature
[1017244.158902] libceph: osd9 (1)10.26.15.52:6835 bad crc/signature
[1017244.447207] libceph: read_partial_message 00000000f3cf65bb data crc 1408972093 != exp. 924660784
[1017244.447219] libceph: read_partial_message 00000000e3f14270 data crc 3335203508 != exp. 1778117778
[1017244.447999] libceph: osd6 (1)10.26.15.51:6839 bad crc/signature
[1017244.449264] libceph: osd7 (1)10.26.15.51:6809 bad crc/signature
[1017286.247718] libceph: osd2 (1)10.26.15.50:6888 socket closed (con state OPEN)
[1017353.848331] CIFS: __readahead_batch() returned 417/1024
[1017400.739167] CIFS: __readahead_batch() returned 807/1024
[1017433.859973] libceph: osd2 (1)10.26.15.50:6888 socket closed (con state OPEN)
[1017435.001857] CIFS: __readahead_batch() returned 673/1024
[1017444.070251] CIFS: __readahead_batch() returned 485/1024
[1017449.026749] CIFS: __readahead_batch() returned 671/1024
[1017480.765560] CIFS: __readahead_batch() returned 27/1024
[1017515.909154] CIFS: __readahead_batch() returned 836/1024
[1017547.569246] CIFS: __readahead_batch() returned 209/1024
[1017550.750279] CIFS: __readahead_batch() returned 268/1024
[1017624.120455] libceph: read_partial_message 0000000067045f34 data crc 1121178867 != exp. 3543028163
[1017624.120777] libceph: osd4 (1)10.26.15.51:6823 bad crc/signature
[1017766.110312] libceph: osd2 (1)10.26.15.50:6888 socket closed (con state OPEN)
[1017922.426403] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Sending cookies. Check SNMP counters.
[1018266.038066] libceph: osd0 (1)10.26.15.50:6834 socket closed (con state OPEN)
[1018306.225506] libceph: osd0 (1)10.26.15.50:6834 socket closed (con state OPEN)
[1018738.060522] cifs_demultiplex_thread: 6 callbacks suppressed
[1018738.060526] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 0
[1018738.061679] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1018738.061681] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1018738.061682] 00000020: 00000000 00000000 00000000 00000000 ................
[1018738.061683] 00000030: 00000000 00000000 00000000 00000000 ................
[1018738.061688] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 0
[1018738.061956] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
[1018738.061957] 00000010: 00000001 00000000 ffffffff ffffffff ................
[1018738.061958] 00000020: 00000000 00000000 00000000 00000000 ................
[1018738.061959] 00000030: 00000000 00000000 00000000 00000000 ................

Mar 16 15:27:34 TestContainer kernel: [1015666.246197] libceph: osd2 (1)10.26.15.50:6888 socket closed (con state OPEN)
Mar 16 15:27:59 TestContainer kernel: [1015691.117903] libceph: read_partial_message 000000004efe0383 data crc 1593899735 != exp. 106228753
Mar 16 15:27:59 TestContainer kernel: [1015691.118711] libceph: osd10 (1)10.26.15.52:6804 bad crc/signature
Mar 16 15:38:59 TestContainer kernel: [1016351.343569] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 2
Mar 16 15:38:59 TestContainer kernel: [1016351.344420] 00000010: 00000001 00000000 ffffffff ffffffff ................
Mar 16 15:38:59 TestContainer kernel: [1016351.344422] 00000030: 00000000 00000000 00000000 00000000 ................
Mar 16 15:38:59 TestContainer kernel: [1016351.345075] 00000020: 00000000 00000000 00000000 00000000 ................
Mar 16 15:38:59 TestContainer kernel: [1016351.345519] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........
Mar 16 15:38:59 TestContainer kernel: [1016351.345521] 00000020: 00000000 00000000 00000000 00000000 ................
Mar 16 15:38:59 TestContainer kernel: [1016351.346683] 00000010: 00000001 00000000 ffffffff ffffffff ................
Mar 16 15:38:59 TestContainer kernel: [1016351.346686] 00000030: 00000000 00000000 00000000 00000000 ................
Mar 16 15:42:34 TestContainer kernel: [1016566.232409] libceph: osd2 (1)10.26.15.50:6888 socket closed (con state OPEN)
Mar 16 15:45:39 TestContainer kernel: [1016751.347292] CIFS: __readahead_batch() returned 3/1024
Mar 16 15:48:34 TestContainer kernel: [1016926.848117] CIFS: __readahead_batch() returned 467/1024
Mar 16 15:48:38 TestContainer kernel: [1016930.514902] CIFS: __readahead_batch() returned 798/1024
Mar 16 15:53:41 TestContainer kernel: [1017233.937056] libceph: read_partial_message 00000000c9d36e25 data crc 1459313744 != exp. 2082328904
Mar 16 15:53:41 TestContainer kernel: [1017233.937486] libceph: osd9 (1)10.26.15.52:6835 bad crc/signature
Mar 16 15:53:43 TestContainer kernel: [1017235.084800] libceph: osd1 (1)10.26.15.50:6810 bad crc/signature
Mar 16 15:53:52 TestContainer kernel: [1017244.157522] libceph: read_partial_message 00000000cb3187d9 data crc 3966178819 != exp. 901871706
Mar 16 15:55:41 TestContainer kernel: [1017353.848331] CIFS: __readahead_batch() returned 417/1024
Mar 16 15:57:48 TestContainer kernel: [1017480.765560] CIFS: __readahead_batch() returned 27/1024
Mar 16 15:58:23 TestContainer kernel: [1017515.909154] CIFS: __readahead_batch() returned 836/1024
Mar 16 15:58:58 TestContainer kernel: [1017550.750279] CIFS: __readahead_batch() returned 268/1024
Mar 16 16:18:46 TestContainer kernel: [1018738.060526] CIFS: VFS: \\fileserver No task to wake, unknown frame received! NumMids 0
Mar 16 16:18:46 TestContainer kernel: [1018738.061681] 00000010: 00000001 00000000 ffffffff ffffffff ................
Mar 16 16:23:01 TestContainer CRON[69720]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)

Did the errors start appearing after a specific update? The crc errors are from Ceph, but shouldn't be critical if they're just for reads. Is your network connection okay? How does the load look like?

quanto11 · Mar 17, 2023

Hey Fiona,

I actually don't know if the "bad crc" problem is really something else entirely. I didn't know where the problem came from for a long time until I stumbled upon this thread with the SMB connection aborts. The aborts result in various error messages and always bring along these "bad crc" things. This could mean that not only the kernel has problems with the mount aborts, but also causes these messages, which can be seen globally in the Proxmox syslogs.

I can't guarantee one hundred percent, but I would say that the SMB aborts and "bad crc" problems have started since kernel 6.1. The NICs are never more than 40% utilised, are FW technically up to date and iPerf allows full bandwidth to the other host. The network does not cause any problems.

fiona · Mar 17, 2023

quanto11 said:
Hey Fiona,

I actually don't know if the "bad crc" problem is really something else entirely. I didn't know where the problem came from for a long time until I stumbled upon this thread with the SMB connection aborts. The aborts result in various error messages and always bring along these "bad crc" things. This could mean that not only the kernel has problems with the mount aborts, but also causes these messages, which can be seen globally in the Proxmox syslogs.

I can't guarantee one hundred percent, but I would say that the SMB aborts and "bad crc" problems have started since kernel 6.1. The NICs are never more than 40% utilised, are FW technically up to date and iPerf allows full bandwidth to the other host. The network does not cause any problems.

I'd try booting with an older kernel, to see if the kernel is actually the culprit. If yes, I'd try the 6.2 kernel next, maybe the issue got resolved there already.

quanto11 · Mar 18, 2023

fiona said:
I'd try booting with an older kernel, to see if the kernel is actually the culprit. If yes, I'd try the 6.2 kernel next, maybe the issue got resolved there already.

We had insane problems using Kernel 6.1.X in our envoirement, I am not in the mood to experiment further at the moment. But i tried Kernel 5.19, which worked flawlessly. And you were right. Kernel 5.19 does not unmount the respective shares, they remain and do not crash. Thank you so much for your help!

kafisc · Apr 28, 2023

Did anythin resolve this issue? I'm also on Kernel 6.1.10-1-pve and I'm getting the same messages:

Code:

Apr 28 00:00:01 proxmox kernel: [39156.521126] CIFS: VFS: \\192.XXX.XXX.XXX No task to wake, unknown frame received! NumMids 4
Apr 28 00:00:01 proxmox kernel: [39156.521127] 00000000: 424d53fe 00000040 00000000 00000012  .SMB@...........
Apr 28 00:00:01 proxmox kernel: [39156.521128] 00000010: 00000001 00000000 ffffffff ffffffff  ................
Apr 28 00:00:01 proxmox kernel: [39156.521128] 00000020: 00000000 00000000 00000000 00000000  ................

fiona · Apr 28, 2023

Hi,

kafisc said:

Did anythin resolve this issue? I'm also on Kernel 6.1.10-1-pve and I'm getting the same messages:

Code:

Apr 28 00:00:01 proxmox kernel: [39156.521126] CIFS: VFS: \\192.XXX.XXX.XXX No task to wake, unknown frame received! NumMids 4
Apr 28 00:00:01 proxmox kernel: [39156.521127] 00000000: 424d53fe 00000040 00000000 00000012  .SMB@...........
Apr 28 00:00:01 proxmox kernel: [39156.521128] 00000010: 00000001 00000000 ffffffff ffffffff  ................
Apr 28 00:00:01 proxmox kernel: [39156.521128] 00000020: 00000000 00000000 00000000 00000000  ................

this is likely a bug with 6.1 kernels, so either upgrade to 6.2 (the current opt-in kernel) or switch back to the 5.15 kernel. If the issue is still there with 6.2 kernels, we might need to take a closer look.

kafisc · Apr 28, 2023

fiona said:
Hi,

this is likely a bug with 6.1 kernels, so either upgrade to 6.2 (the current opt-in kernel) or switch back to the 5.15 kernel. If the issue is still there with 6.2 kernels, we might need to take a closer look.

Thank's for your reply.
I've just installed 6.2.11-1-pve and will report back if the issue persists.

Tmanok · Apr 28, 2023

quanto11 said:
We had insane problems using Kernel 6.1.X in our envoirement, I am not in the mood to experiment further at the moment. But i tried Kernel 5.19, which worked flawlessly. And you were right. Kernel 5.19 does not unmount the respective shares, they remain and do not crash. Thank you so much for your help!

Thanks for confirming this, helped me in one of my environments with non critical SMB. Most of my storage is NFS or CEPH thankfully.

Tmanok

kafisc · Apr 29, 2023

kafisc said:
Thank's for your reply.
I've just installed 6.2.11-1-pve and will report back if the issue persists.

The issue persists. But it occurs way less often. I can only see one smb entry in the kern.log since yesterday:

Code:

Apr 28 14:31:20 proxmox kernel: [ 3625.889746] 00000020: 00000000 00000000 00000000 00000000  ................
Apr 28 14:57:25 proxmox kernel: [ 5190.812476] 00000000: 424d53fe 00000040 00000000 00000012  .SMB@...........

I've got another issue with kernel 6.2.11-1 but it can be resolved with a workaround:

Code:

Apr 28 23:46:05 proxmox kernel: [36912.064812] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Apr 28 23:46:09 proxmox kernel: [36915.868167] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Apr 29 01:36:24 proxmox kernel: [43530.380177] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0

Workaround > Add to GRUB_CMDLINE_LINUX_DEFAULT as follows:
pcie_aspm=off

fiona · May 3, 2023

kafisc said:
The issue persists. But it occurs way less often. I can only see one smb entry in the kern.log since yesterday:

Code:

Apr 28 14:31:20 proxmox kernel: [ 3625.889746] 00000020: 00000000 00000000 00000000 00000000 ................ Apr 28 14:57:25 proxmox kernel: [ 5190.812476] 00000000: 424d53fe 00000040 00000000 00000012 .SMB@...........

Please share the full syslog from around the time the error happens, the section for the storage in /etc/pve/storage.cfg and the output of pveversion -v. What happens with the storage after the error (e.g. do VMs get stuck)? Please also describe your CIFS server a bit.

cholzer · May 16, 2023

I use a smb share for backups, almost everytime I do a backup proxmox loses connection for some reason.
I have already tried to upgrade the kernel, but still the same issue after ~15% of the backup is done.

When this happens the NAS appears with status "unknown" inside of proxmox, I must reboot proxmox to get it working again. the NAS itself (unraid) continues to work just fine and can be accessed from any other device.

IN THIS STATE PVE CANT BE SHUTDOWN NOR REBOOTED!!!!

The backup job which is stuck due to the CIFS connection dissapearing blocks everything.

Stopping the backup job does not work either.

Code:

May 16 17:12:01 pve pvedaemon[3064]: starting termproxy UPID:pve:00000BF8:0000B893:64639D41:vncshell::root@pam:
May 16 17:12:01 pve pvedaemon[1032]: <root@pam> starting task UPID:pve:00000BF8:0000B893:64639D41:vncshell::root@pam:
May 16 17:12:02 pve pvedaemon[1033]: <root@pam> successful auth for user 'root@pam'
May 16 17:12:02 pve login[3068]: pam_unix(login:session): session opened for user root(uid=0) by (uid=0)
May 16 17:12:02 pve systemd[1]: Created slice User Slice of UID 0.
May 16 17:12:02 pve systemd[1]: Starting User Runtime Directory /run/user/0...
May 16 17:12:02 pve systemd-logind[711]: New session 1 of user root.
May 16 17:12:02 pve systemd[1]: Finished User Runtime Directory /run/user/0.
May 16 17:12:02 pve systemd[1]: Starting User Manager for UID 0...
May 16 17:12:02 pve systemd[3074]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
May 16 17:12:02 pve systemd[3074]: Queued start job for default target Main User Target.
May 16 17:12:02 pve systemd[3074]: Created slice User Application Slice.
May 16 17:12:02 pve systemd[3074]: Reached target Paths.
May 16 17:12:02 pve systemd[3074]: Reached target Timers.
May 16 17:12:02 pve systemd[3074]: Listening on GnuPG network certificate management daemon.
May 16 17:12:02 pve systemd[3074]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 16 17:12:02 pve systemd[3074]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
May 16 17:12:02 pve systemd[3074]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
May 16 17:12:02 pve systemd[3074]: Listening on GnuPG cryptographic agent and passphrase cache.
May 16 17:12:02 pve systemd[3074]: Reached target Sockets.
May 16 17:12:02 pve systemd[3074]: Reached target Basic System.
May 16 17:12:02 pve systemd[3074]: Reached target Main User Target.
May 16 17:12:02 pve systemd[3074]: Startup finished in 151ms.
May 16 17:12:02 pve systemd[1]: Started User Manager for UID 0.
May 16 17:12:02 pve systemd[1]: Started Session 1 of user root.
May 16 17:12:02 pve login[3089]: ROOT LOGIN  on '/dev/pts/0'
May 16 17:12:30 pve pvestatd[1005]: got timeout
May 16 17:12:32 pve pvestatd[1005]: status update time (45.179 seconds)
May 16 17:12:36 pve pvestatd[1005]: got timeout
May 16 17:12:37 pve pvestatd[1005]: unable to activate storage 'nas' - directory '/mnt/pve/nas' does not exist or is unreachable
May 16 17:12:37 pve pvestatd[1005]: status update time (5.065 seconds)
May 16 17:12:46 pve pvestatd[1005]: got timeout

Code:

root@pve:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:

cifs: nas
        path /mnt/pve/nas
        server 192.168.1.5
        share proxmox
        content snippets,vztmpl,backup,iso
        prune-backups keep-all=1
        username proxmox

fiona · May 17, 2023

Hi,

cholzer said:
I use a smb share for backups, almost everytime I do a backup proxmox loses connection for some reason.
I have already tried to upgrade the kernel, but still the same issue after ~15% of the backup is done.

When this happens the NAS appears with status "unknown" inside of proxmox, I must reboot proxmox to get it working again. the NAS itself (unraid) continues to work just fine and can be accessed from any other device.

IN THIS STATE PVE CANT BE SHUTDOWN NOR REBOOTED!!!!

The backup job which is stuck due to the CIFS connection dissapearing blocks everything.

Stopping the backup job does not work either.

As a workaround, running umount -f -l /mnt/pve/nas might help to get out of the stuck state.

How does the load on your host look like during backup (IO, network, CPU)? Is the network stable? You could try limiting the amount of workers used for VM backup and see if that helps: https://forum.proxmox.com/threads/118430/#post-513106

cholzer · May 17, 2023

fiona said:
How does the load on your host look like during backup (IO, network, CPU)? Is the network stable? You could try limiting the amount of workers used for VM backup and see if that helps: https://forum.proxmox.com/threads/118430/#post-513106

the machine is directly plugged into the network switch (it sits right next to it - so does the NAS), so yes, network connection is stable.

the PVE machine uses a 1GbE link, my NAS a 10GbE link.

when I did the backup, I had all (both) VMs stopped. so it is literally just proxmox VE that is active while the backup is done.
the machine itself is not very powerful tho because it is only used to host a HomeAssitant VM and a log collection VM (Ubuntu server) which need very little resources.
8 x AMD Ryzen 5 PRO 2500U with 8GB of RAM

Currently I am doing backups to an external USB drive connected to the PVE machine, that works like a charm.

cholzer · May 17, 2023

fiona said:
You could try limiting the amount of workers used for VM backup and see if that helps: https://forum.proxmox.com/threads/118430/#post-513106

I tried that, did not work.
CPU load goes up to 50%, IO delay does not go past 3% - until the cifs share is lost then the IO delay is stuck at 12%

I then tried

Code:

umount -f -l /mnt/pve/nas

from syslog:

Code:

May 17 09:41:27 pve pvestatd[1047]: unable to activate storage 'nas' - directory '/mnt/pve/nas' does not exist or is unreachable
May 17 09:41:27 pve pvestatd[1047]: status update time (5.092 seconds)
May 17 09:41:30 pve systemd[17740]: mnt-pve-nas.mount: Succeeded.
May 17 09:41:30 pve systemd[1]: mnt-pve-nas.mount: Succeeded.
May 17 09:41:33 pve kernel: CIFS: Attempting to mount \\192.168.1.5\proxmox
May 17 09:43:12 pve pvestatd[1047]: status update time (99.763 seconds)
May 17 09:43:15 pve pvestatd[1047]: got timeout
May 17 09:43:16 pve pvestatd[1047]: unable to activate storage 'nas' - directory '/mnt/pve/nas' does not exist or is unreachable

did not work.
- NAS connection still gone
- io delay stuck at 12%
- backup stuck
- cannot abort backup
- cannot shutdown/reboot PVE because PVE cant abort the backup (hard reset - power cycle is the only way out)

cholzer · May 18, 2023

i just noticed that even though i did not run anymore backups to that nas as target, the cifs mount has gone bad on its own and appears with status "unknown" inside of PVE.

meanwhile, inside the ubuntu VM running on that machine, a cifs link to the same nas works just fine.....

cholzer · May 28, 2023

so anyone an idea what could cause this?

cholzer · Jul 16, 2023

Some more info on this:

While Proxmox is in that state where it can nolonger reach that SMB/CIFS storage:
- it is also unable to reach any newly added SMB/CIFS storage on a different server in the network
- it IS able to connect to an NFS strorage located on either of the NAS systems to which it can nolonger connect via CIFS/SMB!?!?

So what is broken with SMB/CIFS in Proxmox please?
Why does a reboot of proxmox fixes SMB/CIFS connections for a while until it breaks again?

It is clearly not the NAS systems.

cholzer · Jul 25, 2023

5 days ago I upgraded to 8.0.3 - since then CIFS has been stable. Fingers crossed that 8.0.3 fixed the issue

Search

Search

Proxmox looses SMB/CIFS connection

fiona

Proxmox Staff Member

quanto11

Member

fiona

Proxmox Staff Member

quanto11

Member

kafisc

Member

fiona

Proxmox Staff Member

kafisc

Member

Tmanok

Renowned Member

kafisc

Member

fiona

Proxmox Staff Member

cholzer

Member

fiona

Proxmox Staff Member

cholzer

Member

cholzer

Member

cholzer

Member

cholzer

Member

cholzer

Member

cholzer

Member

We value your privacy