[SOLVED] Strange cluster behavior after upgrade from 7 to 8

Dec 8, 2023
6
0
1
Hello,

We have a three node cluster, and yesterday we performed an upgrade from 7 to 8.1.4. The upgrade process went smoothly and cluster is up and running, but is behaving a bit strangely. Everything is normal for a few minutes, and then random one or two nodes start showing question marks instead of normal icons. At the same time, the subscription status switches to "No Subscription".

no_subscription.jpg


After a minute or two, everything switches back to normal and the same process keeps repeating over and over since the upgrade.

While the question marks are on, we can perform almost every operation on the cluster as always. The only thing we can't do is migrate machines to the node which has question marks on. Migration from a node with question marks to one with normal icons works fine.

Thanks in advance for any suggestion how to resolve the issue.

Best regards,
Vedran
 
Hi,
please check your system logs/journal on the nodes with the issues. Is there any storage that might not be responding quickly? Unfortunately, that can currently interfere with gathering the status as a whole: https://bugzilla.proxmox.com/show_bug.cgi?id=3714
 
Hi Fiona,
Thank you for your quick response. You are right, there are syslog messages about errors in iSCSI connections. Here is one of them:

Bash:
2024-04-26T12:36:51.169502+02:00 prox3 iscsid: connection-1:0 cannot make a connection to fe80::9209:d0ff:fe27:4b89:3260 (-1,22)
2024-04-26T12:36:54.146592+02:00 prox3 pvestatd[1239]: command '/usr/bin/iscsiadm --mode node --targetname iqn.2000-01.com.synology:HQ-DataCluster.Target-13.8ae72ef83be --login' failed: exit code 15
2024-04-26T12:36:54.161178+02:00 prox3 pvestatd[1239]: status update time (366.919 seconds)
2024-04-26T12:36:54.169700+02:00 prox3 iscsid: Connection-1:0 to [target: iqn.2000-01.com.synology:HQ-DataCluster.Target-13.8ae72ef83be, portal: fe80::9209:d0ff:fe27:4b89,3260] through [iface: default] is shutdown.

However, the VMs running on those disks are running without problems, even when the question marks are displayed. Since none of us have much experience with iscsiadm, if you know a quick fix for it, we would appreciate it.

We are also considering moving our last few iSCSI disks to NFS. However, when we try to choose new Target storage, it doesn't display anything and connection times out (596). Yesterday we were able to choose storage without a problem, while creating an new VM.
 
iscsiadm -m node:

Bash:
192.168.5.222:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-1.8ae72ef83be
[fe80::9209:d0ff:fe27:4b89]:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-1.8ae72ef83be
192.168.5.222:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-11.8ae72ef83be
[fe80::9209:d0ff:fe27:4b89]:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-11.8ae72ef83be
192.168.5.222:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-12.8ae72ef83be
[fe80::9209:d0ff:fe27:4b89]:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-12.8ae72ef83be
192.168.5.222:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-13.8ae72ef83be
[fe80::9209:d0ff:fe27:4b89]:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-13.8ae72ef83be

iscsi -m session:

Bash:
tcp: [1] 192.168.5.222:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-11.8ae72ef83be (non-flash)
tcp: [2] 192.168.5.222:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-13.8ae72ef83be (non-flash)
tcp: [3] 192.168.5.222:3260,1 iqn.2000-01.com.synology:HQ-DataCluster.Target-1.8ae72ef83be (non-flash)

We get the same error for all three connections from session command. And node command shows one target (12.8) that was deleted from cluster storage long ago.
 
Hello Fiona,

Sorry for being this late with my reply, I had to go on a trip abroad and couldn't try out your suggestion until now.

We did remove the unused iSCSI node from all the servers in the cluster, but the node kept reappearing. Then we deleted the unused iSCSI disk and LUN from the NAS, and all the servers removed the unused node by themselves. After that, we upgraded the servers to PVE 8.2.2., rebooted each one, but the issue is still present.

We will move all iSCSI disks to NFS and delete all iSCSI storage. However, we will have to do that at night because it uses all the bandwidth. I will let you know if it resolves the issue.
 
We managed to move all the disks from iSCSI to NFS. Weird thing was that when we tried to choose where to move the disks, the Move disk dialog box couldn't load the Target storage list. It would try to load it and fail after a while with a Communication failure message. After a while we found out that if we tried to add a new hard disk, Storage list would load there, and after that we could load the list in the Move disk dialog.

The question mark issues have disappeared since the move to NFS and the cluster is back to full functionality. However, we have tried to remove the iSCSI nodes using the command from the link you last sent (https://bugzilla.proxmox.com/show_bug.cgi?id=5173#c17), but without success. The session and the node would disappear for a while, only to reappear after ten to fifteen seconds. Then we tried to remove the iSCSI disks from the NAS, since that made the servers remove the unused node by themselves (see last post), but nothing happened. When we tried to remove the sessions again, the command throws Could not logout of and 32 - target likely not connected. System log shows that iscsid is constantly (every 3 seconds) trying to connect to iSCSI and fails with Connection refused message. If you have any more suggestions, we would appreciate them.
 
Finally, at some point, we were able to remove the nodes and sessions on two servers with following commands:

Bash:
iscsiadm -m node -T <IQN> -u
iscsiadm -m node -T <IQN> -o delete

On the third server, nodes were removed, but not the sessions. The sessions were finally removed after we rebooted the server.

We also had to remove the /etc/iscsi/send_targets/<Target Folder>

We find it strange that all of that wasn't done by PVE when we removed the iSCSI drives from Storage. However, since we had found out that Synology has many other problems with iSCSI drives, the problem might be due to that. On another cluster, that has iSCSI drives on QNAP NAS, we have never experienced any glitch with them. On Synology, we have finally moved all of our disks to NFS.

Another thing I have to mention is that, even though it was a bit scary to see the strange behavior after the update (we have some essential machines on that cluster), the cluster was rock solid the entire time and we could use it for everything as we normally do.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!