Synology UC3200 crashing

bzdigblig

Member
Aug 6, 2021
29
0
6
43
Just wondering if anybody's got any ideas here. We've got a 3 node Proxmox cluster, with a Synology UC3200 as shared storage. The UC3200 has dual controllers in an active-active SAN configuration, so if one controller craps out, the workload won't be interrupted. The nodes and the UC3200 are all connected via 10 gig (2x10 gig for storage, 2x10 gig for Corosync, 2x10 gig for VM traffic).

Everything's rock solid and stable as long as my VMs are all on one node. If I move a VM to another node, even if it's just sitting there idling...the UC3200 gets really unstable, to the point where if there's two idling VMs running on two nodes, the UC3200 will crash every day...day and a half. If I'm doing something with even moderate writes, the UC3200 will crash within a couple minutes of that activity. When the UC3200 crashes, both controllers go offline and reboot within 1 second of each other

I reached out to Synology and they sent us another unit, which had the same issues. They escalated to their dev/engineering dept, sent us another unit, and it's exhibiting the same issues, so it's pretty unlikely that the UC3200 hardware is still faulty, so I'm left with either assuming the UC3200 has a software issue, or Proxmox is somehow doing something to make the SAN lose its mind...

Anybody have any suggestions, theories, or wild speculation?
 
It sounds like you essentially found a way to DoS your SAN.
Giving that the only communication between PVE and Synology is iSCSI, a 23 year old industry standard protocol, and the iSCSI initiator (client) in PVE is a standard Linux Kernel, there is absolutely no reason for Synology to behave the way it does. A client should never be able to cause a server (storage in this case) to crash.

I am not too surprised by their lack of desire to troubleshoot the issue. They are in a "Prosumer" segment with a "replacement" being goto solution strategy. Since you are now on a 3rd unit, I suspect that you have to troubleshoot and find the culprit yourself. That is if you are interested in investing time. There is also a question of "is anyone listening on the other end?".

I wish you best of luck in this. Under different circumstances I would have loved to geek out over the network traces to solve it.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
It sounds like you essentially found a way to DoS your SAN.
Giving that the only communication between PVE and Synology is iSCSI, a 23 year old industry standard protocol, and the iSCSI initiator (client) in PVE is a standard Linux Kernel, there is absolutely no reason for Synology to behave the way it does. A client should never be able to cause a server (storage in this case) to crash.

I am not too surprised by their lack of desire to troubleshoot the issue. They are in a "Prosumer" segment with a "replacement" being goto solution strategy. Since you are now on a 3rd unit, I suspect that you have to troubleshoot and find the culprit yourself. That is if you are interested in investing time. There is also a question of "is anyone listening on the other end?".

I wish you best of luck in this. Under different circumstances I would have loved to geek out over the network traces to solve it.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Those are my thoughts exactly. This config ain't rocket science. Even if was doing something crazy and causing an issue to happen, that should be my problem, not a problem with the storage device. Synology's been quite responsive overall, but getting into a real technical discussion seems unlikely. I send them logs, they send me UC3200s, it seems.

I've looked at the UC3200 logs myself and there's not a whole lot to go on, at least that I'm able to recognize, and I'm not entirely sure where to go from here.
 
I am waiting for my supplier to deliver a UC3200 that will be connected to a cluster of 3 nodes. Were you able to solve the problem?
 
I am waiting for my supplier to deliver a UC3200 that will be connected to a cluster of 3 nodes. Were you able to solve the problem?
The problem seems to be solved, but I didn't have much to do with it.

The last update that Synology released for the UC3200 broke the non-transparent bridge linking the two controllers together, and this would cause the UC3200 to reboot itself regularly. Synology had to have one of their techs remote into the UC3200 and manually apply a patch to fix the problem.

This fix was just applied recently so I've been stability and performance testing until we're satisfied that we can trust this hardware.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!