Timeout issue

lweidig

Active Member
Oct 20, 2011
104
2
38
Sheboygan, WI
We have a number of disk images stored on a Nexenta server. We applied the latest updates to a machine, rebooted and saw that one of the KVM machines did not start and kept generating the error:

Code:
TASK ERROR: command '/usr/bin/ssh -o 'BatchMode=yes' -i /etc/pve/priv/zfs/xx.xx.xx.xx_id_rsa root@xx.xx.xx.xx /usr/sbin/sbdadm list-lu ''' failed: got timeout

Take the command shown and run it from an SSH session and the answer is returned within a couple of seconds. Finally ended up after some searching editing the file /usr/share/perl5/PVE/Storage/LunCmd/Comstar.pm and adding the command $timeout = 15; right after the test for timeout being set. Started the machine again and off it went. Wondering if somehow the timeout values are just too low and need to be adjusted or why this is now failing.
 
Posting my own solution in case it can help another person in the future. It eventually got to the point the commands were failing both batch and from the command line direct, with VERY long (30+ second) resonse times. They did respond always, just VERY slowly. This continued and of course each update causes files to get rewritten and was breaking my "patch".

So I dug further and spent many hours looking into it. Finally after some low level debugging on the Nexanta machine I was able to see each ssh session was opening EVERY locale file for some reason and this was taking a very long time. Solution for me was to delete ALL of the locales in /usr/lib/locale, except for the ones used by us (this may vary for you so be careful). In our case we kept: C,en_US.ISO8859-1, en_US.ISO8859-15, en_US.UTF-8 and POSIX. I did backup the entire folder prior to deleting as well. After this sub ms timeframe logins returned!

Hope this helps somebody else, I spent HOURS tracking this down with various network, mtu, ssh, ... settings being the standard culprits.