PVE Host seems to go "offline" periodically while moving a virtual disk

Gurn_Blanston

Member
May 13, 2016
44
0
6
55
Hello,

I first noticed this when using VZDump to backup a VM sitting on a ZFS filesystem to an NFS share on a different host. I am at the moment trying to migrate a 1 TB virtual disk from a Zpool that I wish to destroy and rebuild. The NFS target is backed by a ZFS file system on a fast SSD based zpool. I am using IP over IB. What happens is that after a period of time, the host stops posting telemetry to the PVE GUI and the copy job appears to stop completely. After some minutes of freaking out the host "comes back" and the copy job continues. I took a screen shot to better illustrate this event.

upload_2016-6-23_14-18-7.png

I can see from the lack of blinking lights on the hosts that the zpools are not in use, this is backed up with zpool -iostat, etc. I find errors in the syslog complaining that "cannot find volume..." I will paste in a few of these if it helps.

These two entries show first the unable to find disk type of message and then a copy job being kicked off a fraction of a second later.

Jun 23 13:18:47 pve-1 pvedaemon[8838]: zfs error: cannot open 'pve2zpool2/zsync': dataset does not exist
Jun 23 13:19:08 pve-1 pvedaemon[14058]: <root@pam> move disk VM 101: move --disk ide0 --storage pve1raid10temp

Actually this copy job was initiated while another larger copy job was in process. It may be coincidence but as soon as the vm 101 job began, my current job for vm 100 seemed to crash and that is how the first gap in telemetry occurred. The vm 100 job did complete, however, and then telemetry seemed to come back and the VM 100 job resumed chugging along. Then spontaneously stopped, started again, stopped, etc.

Another error pops up from time to time in the Syslog relating to the NFS share itself. Here it is:

Jun 23 14:01:22 pve-1 kernel: nfs: server 10.0.1.70 not responding, still trying
Jun 23 14:01:22 pve-1 kernel: nfs: server 10.0.1.70 OK

This error is puzzling in that the server then apparently responds immediately after the error is posted. I might have expected the "not responding, still trying" error to appear while the problem is occurring and then after some time passes the second item saying the server is OK would be posted. Instead it seems that something causes PVE to think NFS is down and it just stops trying, then after a varying amount of time passes, it decides to give it a try again and it works fine. So, what is causing it to stop trying? I am not sure how to go about looking. Any thoughts as to what is causing the server to "go away"? Could it be some part of the clustering system that gets bogged down and requires the host to essentially stop doing whatever it is doing while it catches up?

The host servers are both SuperMicro DX10 motherboards with two Xeon E2640 V4 (10 core) in SuperMicro 4U chassis with 256 GB RAM and SSD based Zpools with Infiniband for the storage traffic (host to host backups, disk migration, etc.). I don't think that the bottleneck is the target being busy writing from write cache to disk because there is absolutely no disk activity during these pauses.

I recently upgraded the Mellanox stack, which did help transfer speeds some and made VZDump backups possible. They would crash before and require manual unlocking of the host and killing of the VZDump job. Sometimes a reboot would be needed after it crashed. Now at least when it stops copying, it doesn't just give up completely and crash. The host just "goes away" from the cluster for a while. In any case, I now have a few production servers in this environment and not being able to migrate the vdisks without bringing the host offline intermittently is bad for my stomach.

Sincerely,

GB
 
I have been Googling around and find other PVE users experiencing similar issues with NFS. I get the "task qemu-img blocked for more than 120 seconds" and then an NFS related call trace in the Syslog. I find other people with these errors online but don't find any explanation or solution. I will say that the job eventually completes but it only seems to be active for a minute or two at a time then inactive for several minutes in between. This is no way to live. I also mispoke regarding using Infiniband IP for this job. It is 10G ether straight piped between hosts. I put this in when I was having trouble with Infiniband so I can't blame this on weak Infiniband support in Ubuntu 16.04. I don't think it is actually network related. My links stay active and up while this error is occurring. Also, nfsstat doesn't show any errors that I can perceive. These file transfers should FLY! That is why we spent $30K on server hardware and high speed networking. Has anyone gotten to the bottom of this issue? Anyone else suffering with it? Does SMB work better?
 
that looks very much like pvestatd blocking - probably because of some storage hanging because of the load. you can try triggering this situation and then run "pvesm status" to confirm (if that command hangs / runs into a timeout instead of displaying an overview of your storages, a storage is blocking).
 
Hello Fabian,

I can confirm that pvesm status hangs on the sending server. I am trying to move a 100 GB raw image. It shot up to 5% then stalled for twenty minutes, then shot up to 29% and stalled again. This is where it is when I got the failed "pvesm status". I noticed when trying this command on the receiving server that it complains about an old zfs dataset that has been removed but not yet from the GUI's storage interface. I removed it. Also, on the sending server, the pvesm status complains about not being able to find a zfs dataset that is on the receiving server. This dataset is alive and well.

However, the dataset is not shared in any way. When you add zfs storage, or probably any other sort of storage, the GUI by default selects all nodes. Is that causing this issue? I have wondered what this setting actually does. When I was playing with PVE in the lab I would fastidiously only select the host that the dataset resides on then I decided that it didn't seem to have any consequences to leaving it default. I thought that maybe there was some sort of behind the scenes kung fu going on that could get some benefit from letting all nodes have access to the dataset. I knew much less about PVE and ZFS then and have more or less become accustomed to leaving it default. Nowadays, I doubt that there is any benefit to letting all nodes "see" the dataset and maybe this is causing pvestatd to get choked up. I updated the storage stack to limit nodes to just the ones that actually hold the dataset. It doesn't seem to have helped, the qm move job is still getting stuck but maybe I have to reboot the host for this to take effect? I don't know. It does occur to me that if I want to use PVEZsync, which I do, that maybe I would need to let all nodes "see" the dataset since it is using a zfs send/receive over scp. Please confirm or deny this so I know how to set it properly.

I have a hard time believing that the issue is slow storage. I don't have any slow zpools, my slowest one is made of three mirrored vdevs and it can write sequentially at 250+ MB/S continuously. It seems to write for a little while, then stop for a long while, then write for a little bit again, then stop. There is no obvious pattern to the intervals of activity vs non-activity, by which I mean that it doesn't work for sixty seconds, then stop for 120 seconds, etc. Should I be blaming this on NFS? We are trying to NOT buy another SAN simply because we want the virtual machines to get the benefit of using the local SAS bus instead of iSCSI etc. Would we be better off using each node as a SMB server and using that topology to move images from host to host and backups? ZFS has a "sharesmb" dataset property but I don't know how well integrated it is with PVE. No matter how one tries to get around it, it seems like if you don't have real shared storage you are hobbled. DRBD was fun to play with but it too seems like a scarily complicated house of cards if you are committed to using ZFS.

We should have so much throughput but after two months of effort I can''t seem to see that throughput. How do I dig deeper now that we know pvestatd isn't seeing what it wants to?

Thanks,

GB
 
I have been looking for some NFS tuneable options to reduce the amount of time the system waits before trying to see the NFS server again. I have found conflicting reports on what the default value is. Some say 600 Deciseconds and some say default is 7. I have been unable to set this mount option in any event in the "sharenfs property". I believe the option is called "timeo=" but I cannot get PVE to accept my syntax

zfs set sharenfs="rw@10.0.0.0/24,timeo=x"

This just fails with a syntax error. Also, I can't find any sort of global NFS configuration file. Man page is pretty thorough if you use the /etc/exports file but this isn't the proper way on ZFS. Somehow when I run

exportfs -v

I get a list of my shares with their respective export mount options listed, none of which were set by me, other than the "RW" option. Where are these options coming from? One of them is unknown to me and I can't find any description of it and some are listed more than once.

wdelay

What is that? That is all it says, ie, "no_root_squash, wdelay, relatime, etc). Where do I set these tunable options? I find a zillion hits on Google, some for Solaris, some for RHEL, some for Debian, Ubuntu, etc but none that seem to apply to PVE. I can't even figure out if the mount is async or sync. Seems like I should be able to find a combination of settings that work like a champ. There is also a protocol option for RDMA, which should be really fast over Infiniband, one would think. It took me forever to figure out how to pass mount options via the "zfs set" method but I stopped setting them because leaving them out seemed to work just fine (the options find their way into the export somehow). Is there a cfg file somewhere? I cannot find one.

Thanks again,

GB
 
check the man pages (zfs, exportfs, exports) for details:

Code:
# zfs set sharenfs=off bigpool/test
# zfs set sharenfs=rw,async,mountpoint bigpool/test
# exportfs -v
/bigpool/test     <world>(rw,async,wdelay,no_root_squash,no_subtree_check,mountpoint,sec=sys,rw,no_root_squash,no_all_squash)
# zfs set sharenfs=ro,async,mountpoint bigpool/test
# exportfs -v                                      
/bigpool/test     <world>(ro,async,wdelay,no_root_squash,no_subtree_check,mountpoint,sec=sys,ro,no_root_squash,no_all_squash)
# zfs set sharenfs=rw,sync,no_wdelay,mountpoint bigpool/test
# exportfs -v                                              
/bigpool/test     <world>(rw,no_root_squash,no_subtree_check,mountpoint,sec=sys,rw,no_root_squash,no_all_squash)
# zfs unshare bigpool/test
# exportfs -v
# zfs share bigpool/test 
# exportfs -v          
/bigpool/test     <world>(rw,no_root_squash,no_subtree_check,mountpoint,sec=sys,rw,no_root_squash,no_all_squash)

man zfs:
Code:
       sharenfs=on | off | opts

           Controls  whether  the  file  system  is  shared  via  NFS,  and  what options are used. A file system with a sharenfs property of off is managed with the
           exportfs(8) command and entries in /etc/exports file. Otherwise, the file system is automatically shared and unshared with the zfs share and  zfs  unshare
           commands.  If  the  property is set to on, the dataset is shared using the exportfs(8) command in the following manner (see exportfs(8) for the meaning of
           the different options):

               /usr/sbin/exportfs -i -o sec=sys,rw,no_subtree_check,no_root_squash,mountpoint *:<mountpoint of dataset>

           Otherwise, the exportfs(8) command is invoked with options equivalent to the contents of this property.

           When the sharenfs property is changed for a dataset, the dataset and any children inheriting the property are re-shared with the new options, only if  the
           property was previously off, or if they were shared before the property was changed. If the new property is off, the file systems are unshared.
 
Ha ha. I avoid the terminal man pages cause it strains my eyes, thus I tend to look online and sometimes get led astray.

Regarding NFS in general, is this the best way migrate virtual disks between two hosts without using shared storage?

Has anyone else had problems getting NFS to work efficiently when moving large files? I find some negative comments out on the Internet regarding this problem but also see that there are many possible causes. I am pretty much at a loss as to how to dig deeper.

I do find this, however. It found its way into the syslog moments after I initiated a 2 TB qmmove.

Jun 28 16:39:23 pve-2 pvedaemon[29296]: <root@pam> move disk VM 204: move --disk virtio2 --storage pve1backup
Jun 28 16:39:23 pve-2 pvedaemon[29296]: <root@pam> starting task UPID:pve-2:000072F2:08E0886C:5772EE8B:qmmove:204:root@pam:
Jun 28 16:40:47 pve-2 pveproxy[26977]: worker exit
Jun 28 16:40:47 pve-2 pveproxy[19089]: worker 26977 finished
Jun 28 16:40:47 pve-2 pveproxy[19089]: starting 1 worker(s)
Jun 28 16:40:47 pve-2 pveproxy[19089]: worker 29538 started
Jun 28 16:41:21 pve-2 pveproxy[28719]: worker exit
Jun 28 16:41:21 pve-2 pveproxy[19089]: worker 28719 finished
Jun 28 16:41:21 pve-2 pveproxy[19089]: starting 1 worker(s)
Jun 28 16:41:21 pve-2 pveproxy[19089]: worker 29572 started
Jun 28 16:41:26 pve-2 pmxcfs[9769]: [status] notice: received log
Jun 28 16:42:46 pve-2 kernel: sd 1:0:5:0: attempting task abort! scmd(ffff883fb51f6580)
Jun 28 16:42:46 pve-2 kernel: sd 1:0:5:0: [sdf] tag#4 CDB: Write(10) 2a 00 7c 05 01 8a 00 00 02 00
Jun 28 16:42:46 pve-2 kernel: scsi target1:0:5: handle(0x000e), sas_address(0x4433221105000000), phy(5)
Jun 28 16:42:46 pve-2 kernel: scsi target1:0:5: enclosure_logical_id(0x500605b00ab4bac0), slot(6)
Jun 28 16:42:46 pve-2 kernel: [63B blob data]
Jun 28 16:42:46 pve-2 kernel: sd 1:0:5:0: task abort: SUCCESS scmd(ffff883fb51f6580)
Jun 28 16:42:46 pve-2 kernel: sd 1:0:5:0: [sdf] tag#4 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Jun 28 16:42:46 pve-2 kernel: sd 1:0:5:0: [sdf] tag#4 CDB: Write(10) 2a 00 7c 05 01 8a 00 00 02 00

Jun 28 16:42:46 pve-2 kernel: blk_update_request: I/O error, dev sdf, sector 2080702858
Jun 28 16:42:46 pve-2 kernel: sd 1:0:5:0: [sdf] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: sas_address(0x4433221105000000), phy(5)
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: enclosure_logical_id(0x500605b00ab4bac0),slot(6)
Jun 28 16:42:46 pve-2 kernel: [68B blob data]
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: handle(0x000e), ioc_status(success)(0x0000), smid(8)
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: request_len(0), underflow(0), resid(-18944)
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: tag(65535), transfer_count(18944), sc->result(0x00000000)
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
Jun 28 16:42:46 pve-2 kernel: blk_update_request: I/O error, dev sdf, sector 0
Jun 28 16:42:46 pve-2 zed[29635]: eid=94 class=delay pool=pve2zpool1
Jun 28 16:42:46 pve-2 zed[29638]: eid=95 class=io pool=pve2zpool1
Jun 28 16:42:46 pve-2 kernel: mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
Jun 28 16:42:47 pve-2 kernel: sd 1:0:5:0: [sdf] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Jun 28 16:42:47 pve-2 kernel: mpt3sas_cm0: sas_address(0x4433221105000000), phy(5)
Jun 28 16:42:47 pve-2 kernel: mpt3sas_cm0: enclosure_logical_id(0x500605b00ab4bac0),slot(6)
Jun 28 16:42:47 pve-2 kernel: [68B blob data]
Jun 28 16:42:47 pve-2 kernel: mpt3sas_cm0: handle(0x000e), ioc_status(success)(0x0000), smid(24)
Jun 28 16:42:47 pve-2 kernel: mpt3sas_cm0: request_len(0), underflow(0), resid(-18944)
Jun 28 16:42:47 pve-2 kernel: mpt3sas_cm0: tag(65535), transfer_count(18944), sc->result(0x00000000)
Jun 28 16:42:47 pve-2 kernel: mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
Jun 28 16:42:47 pve-2 kernel: mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
Jun 28 16:42:47 pve-2 kernel: blk_update_request: I/O error, dev sdf, sector 0
Jun 28 16:42:48 pve-2 kernel: sd 1:0:5:0: [sdf] tag#3 CDB: Write(10) 2a 00 45 29 31 0d 00 00 22 00
Jun 28 16:42:48 pve-2 kernel: mpt3sas_cm0: sas_address(0x4433221105000000), phy(5)
Jun 28 16:42:48 pve-2 kernel: mpt3sas_cm0: enclosure_logical_id(0x500605b00ab4bac0),slot(6)
Jun 28 16:42:48 pve-2 kernel: [68B blob data]
Jun 28 16:42:48 pve-2 kernel: mpt3sas_cm0: handle(0x000e), ioc_status(success)(0x0000), smid(109)
Jun 28 16:42:48 pve-2 kernel: mpt3sas_cm0: request_len(17408), underflow(17408), resid(-2048)
Jun 28 16:42:48 pve-2 kernel: mpt3sas_cm0: tag(65535), transfer_count(19456), sc->result(0x00000000)
Jun 28 16:42:48 pve-2 kernel: mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
Jun 28 16:42:48 pve-2 kernel: mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
Jun 28 16:42:57 pve-2 pveproxy[29413]: proxy detected vanished client connection
Jun 28 16:43:35 pve-2 pvedaemon[28958]: worker exit
Jun 28 16:43:35 pve-2 pvedaemon[9842]: worker 28958 finished
Jun 28 16:43:35 pve-2 pvedaemon[9842]: starting 1 worker(s)
Jun 28 16:43:35 pve-2 pvedaemon[9842]: worker 29736 started
Jun 28 16:46:34 pve-2 kernel: nfs: server 10.0.0.60 not responding, still trying
Jun 28 16:46:34 pve-2 kernel: nfs: server 10.0.0.60 OK
Jun 28 16:46:34 pve-2 pvestatd[9828]: status update time (400.123 seconds)

Notice at 16:42:46 there is an I/O error. This is one of the SSD disks in pve2zpool1. Zpool status shows that all of the drives up online but there is an error posted.


"One of your devices has experienced an unrecoverable error. An attempt was made to correct the error. Your applications remain unaffected."

This has come up from time to time. There are other i/o errors for other devices in the zpool. The Zpool is made up of two 4 drive RAIDZ1 vdevs. All of the drives are Samsung Evo 850s. I was pressured into using RAIDZ rather than a pool of mirrors and while there is little CPU penalty for the host, I wonder what happens if all of the drives cannot write at the same speed.

I think I need to explore what happens if I exclude the SSD Zpool from my tests and see what happens if I just use the slow spindle pools. Is anyone else trying SSD zpools in a production environment? I don't mean for ZIL/L2ARC and such, I mean zpools with no spindles at all. I have read much outdated commentary regarding sticking to SLC but my customer thinks 3D NAND is good enough and it is much cheaper. Maybe it isn't up to the task and has nasty side effects? Would sticking with mirrors instead of raidz lessen the impact of write performance? Hmmm.


Thanks Everyone,

GB
 
Do you just want to move your vm with zfs disks from one node to another? If so, you can simply migrate it (just make sure both zfs storages have the same storage name) and Proxmox will use zfs send / zfs receive to transfer the disks (including snapshots).
 
You just blew my mind Fabian! Is this documented? What do you mean about "storages have the same storage name"? Do you mean the zfs datasets should have the same name or do both the zpool AND dataset names have to be the same? I have been using a format of

hostnamezpoolx

for the zpools so I can tell them apart in the PVE interface. Are you saying that I should name the zpools all the same thing on each host? For that matter, I would also have to name the datasets the same, right? Is there a man page for this process? The problem with man pages is that you have to know the name of the program to read the man page! I am going to look for a man page under the word migrate but if this isn't it, please set me straight.

This will be a non-trivial task to try it out of so. I will have to evacuate my production zpools and rebuild them with the same name if so. Ugh. Right now, the only way for me to do this is with NFS, which is proving really difficult to troubleshoot. The NFS target under /mnt/pve/nfsshare keeps "vanishing" from the storage system. I was about to give up on NFS entirely and try for SMB. I am able to use SCP all day long but it is a good deal slower than NFS when it is working. I get about 400 MB/s with scp and the limiting factor seems to be that SCP is single threaded and pegs out one of the cores in the host. It could be that it is SSH pegging out, both processes are busy. 400 MB/S is probably to be considered pretty fast for SCP I would think but I want faster! I want 1000 MB/s or more. I think ZFS Send/Receive uses SCP, right? Well, anyway, if this gets me around this overall issue then I may be willing to live with the throughput penalty. I don't really NEED that throughput, it would just make my life easier.

Another question regarding using zfs send/receive, Which network interface is it going to use? My VMBR0 interface is measly 1 Gb Ethernet. How do I make it use either my straightpiped 10 G Ethernet or my Infiniband switched network? They are on discreet subnetworks and not bridged/trunked to the production LAN. Isn't it always a case where the solution to one problem is merely the prelude to a whole new set of problems?

I think changing the DNS records for the hosts to reflect the Infiniband interface IPs will cause no small amount of undesirable side effects. Hmm. I did read some mention a while back about using Infiniband for the cluster network so maybe this is doable afterall. Some guidance would be most helpful. Some documentation would be even more helpful!

Regarding documentation, I have noticed that there is at least one book published for proxmox, "The Proxmox Cookbook". Is this worth having? It doesn't seem especially new so it might not be very up to date. The PVE documentation wiki is a nice broad overview with some step wise procedures but none of them go very deep and many are some years old. I am a paid subscriber so I am not looking to freeload but wish there was somewhere I could get deeper information. I am not criticizing, I think what is there is a great effort for a limited number of developers to have produced but it seems like the PVE support staff would need something more detailed than the wiki.

All of these paragraphs give me the sense that I am violating proper use of a forum with my scattershot questions and maybe it would be better to take this offline? If so, I don't know how to go about it. I am more accustomed to calling for support via phone and eventually getting someone on the phone who can answer my questions in real time rather than blasting you with every conceivable angle to the issue in the hopes of lessening the amount of back and forth with a whole work day in between responses. Would opening this as a support ticket make for a more efficient resolution? I didn't do that because I had hoped that other people in the community had seen this issue and solved it either by finding a more sensible practice or by tweaking some NFS mount option to work better with ZFS. I am now under a bit of pressure to get this into production ASAP. Fabian, I look forward to your guidance.

Most Sincerely,

GB

PS, if this helps, this is what I want to hear back on:

  • do the zpools have to have the same name as well as the datasets? Do I add the storage to PVE only once for one host rather than for each host?
  • Can I make the migration (zfs send/receive) happen on my faster networks (not on the production LAN)?
  • Is there some way to get access to more thorough documentation? As, in "Proxmox Cookbook" or man pages that I don't know the names of?
  • Would opening a PVE Subscription Ticket be more efficient for both Proxmox and me in terms of time? Is this the best approach available for working out the kinks? Is there a way to exchange dialogue privately so people don't have to wade through my long winded posts?
 
Hello Again,

I have been looking for answers on the PVE Wiki and find a wiki page called

"Separate Cluster Network"

This had me excited until I read this:

"It is good practice to use a separate network for corosync, which handles the cluster communication in Proxmox VE. It is one of the most important part in an fault tolerant (HA) system and other network traffic may disturb corosync. Storage communication should never be on the same network as corosync!"

So according to this caution changing the cluster network to my Infiniband subnet so that I can have fast migrations over ZFS Send/Receive from the Web GUI would be a bad idea, right? Is there another way to make the Web GUI ZFS send/receive default to my fast network? My "storage network" in other words? On the other hand, if I have 20+ Gb/s worth of IP over IB throughput, would cluster traffic be hampered during a large migration job?

Am I trying to pound a square peg into a round hole? It would also be difficult to use SMB on the storage network I think since it has to authenticate to an AD controller, which are all on the production LAN.

I wish NFS would just work like it did in the my lab, but over a faster than 1 Gbps interface. I was bonding 8 NIC ports between two hosts with no switch and occasionally getting over 400 MB/s host to host migration throughput and no issues but that was a lab. I would never do this with three nodes and a switch, that would be 24 ports just for the hosts, also, I can't get bonding to work with the VMBR bridge anyway. Ugh!

GB
 
  • do the zpools have to have the same name as well as the datasets? Do I add the storage to PVE only once for one host rather than for each host?
  • Can I make the migration (zfs send/receive) happen on my faster networks (not on the production LAN)?
  • Is there some way to get access to more thorough documentation? As, in "Proxmox Cookbook" or man pages that I don't know the names of?
  • Would opening a PVE Subscription Ticket be more efficient for both Proxmox and me in terms of time? Is this the best approach available for working out the kinks? Is there a way to exchange dialogue privately so people don't have to wade through my long winded posts?

  • yes, the storage definitions are shared over the cluster, and if you have a non-shared storage type configured on each node, you should define it once and use the same names/paths on all the nodes. you can also limit access to a storage to specific nodes, e.g. if you only have a zfs storage with the pool "abc" on two out of three nodes. note that renaming a pool or a dataset is no problem with zfs, for the pool you simply need to "zpool export oldpoolname; zpool import oldpoolname newpoolname" and for datasets you can "zfs rename" - but if those datasets are referenced in e.g. VM/CT configs you will have to manually edit those.
  • the storage migration uses the cluster network, so you can currently only do this by moving both corosync and zfs send/receive to IB. what is usually meant by "separate storage network" is to access shared storage over a different link or use a different link for distributed storage like ceph, gluster, .. - because of the huge amount of traffic that would otherwise impact cluster and VM communication. in your case, your IB network is probably faster than the zfs-send/receive over ssh, so this should not be such a problem. you can simply test it by migrating a big zfs volume and running omping in parallel to see whether the cluster/corosync communication is affected.
  • there is an admin guide including all the man pages, if you are on a current version you have a local copy accessible via https://your-node-ip:8006/pve-docs/index.html (it is also linked in parts of the GUI via the "Help" button)
  • generally speaking, I would suggest opening a support ticket for actual issues/problems, for general questions the forum usually provides a wider ranged input. if you have a standard or premium subscription, we also offer remote troubleshooting over SSH if necessary, so situations where this speeds up problem solving it is also advisable to open a support ticket. I don't think you have to worry about people having to wade through long posts - nobody here is forced to read anything. keeping posts short and to the point will probably lead to more responses though ;)
 
Thank you for the helpful ideas Fabian! Maybe this is the way to go then. It certainly makes things simple for management. Maybe I can speed up SCP by reducing the encryption level or something along these lines. I would love to make NFS work but it seems like an uphill battle. I am not the sort who can forensically analyze a stack trace or whatever it is called in order to find the smoking gun. Maybe in the fullness of time...

GB
 
Here is an update.

I moved the Corosync network to the IB network and that seems to work. I also renamed the zpools and standardized my datasets and at last have all my vm images moved back to the fast zpools but when i do a "migrate" it still seems to use the regular production LAN interface. Is there still something I need to do to make the qmmove jobs run over the IB? I also added the line

migration_unsecure:1

to /etc/pve/datacenter.cfg

Thus I was expecting super fast transfers but it is crappy old GbaseT. What else do I need to edit?

GB
 
I cannot find any setting per se that will define which network migrations should use. I am now wondering if I need to somehow rename my hosts so that PVE associates the IB network IP address with the host name. Is there a clean way to do this? Will it require rebooting the cluster? I see that corosync.conf has a "node/name" field. See below:

Code:
# cat /etc/pve/corosync.conf 2> /dev/null logging { debug: off to_syslog: yes } nodelist { node { name: pve-1 nodeid: 1 quorum_votes: 1 ring0_addr: pve-1-corosync }

Here I have added my IB interface called pve-1-corosync for the "ring0_addr". This puts corosync on the IB network because that is the hostname I have set for my IB IP address in the PVE server's hosts file. I was under the impression that moving the corosync network to IB alone was enough to cause host to host migrations to use the IB network. It seems not to be the case for me. Do I also change my node name to pve-1-corosync? Then if I remove and add back my zpools, will that force the zsend/receive over the IB net? I am afraid to just try it without some reassurance. Seems like it could wreck the cluster if it isn't meant to be changed.



Here is the way it is setup now:

10.0.0.0/24 is on Infiniband
192.168.73.0/24 is production LAN and VMBR0 LAN and just a lame 1 GbaseT NIC

For each host I have these interfaces set in the hosts file (well really just one host file, I know).

pve-1 192.168.73.60/24 pve-1-corosync 10.0.0.60/24
pve-2 192.168.73.70/24 pve-2-corosync 10.0.0.70/24
pve-3 192.168.73.80/24 pve-3-corosync 10.0.0.80/24

The nodes in the PVE web GUI are listed of the format "pve-1", therefore, I am guessing that when I add ZFS storage and pick which node to find my zpool in, it is using that hostname and therefore it is picking up that hostname for the zfs send/receive command it assembles when I select my VM and click on "Migrate". If I change the node name in the corrosync.conf to pve-1-corosync then will my nodes change in the GUI to pve-1-corosync, pve-2-corosync, etc? Will I have to readd my ZFS storages at that point? Will it screw up anything in the cluster such as which Interface HTTPS is listening on and other important services? If I do it this way, I can leave my "/etc/hosts" files alone so that I can still browse to my PVE hosts GUI interface. I am enclosing my "system report" if you get confused. There is a lot of overlapping terminology such as host server and hosts file and host name.

I was unable to find anything on Google about this, probably because I can't think of a specific enough way to phrase my question. I read pretty much the entire manual page by page looking for a specific place to define a default storage network.

GB
 
the storage migration uses the cluster network, so you can currently only do this by moving both corosync and zfs send/receive to IB.

I should have quoted this first. After re-reading this a bunch of times I see that it says "both corosync AND zfs send/receive to IB". I found a wiki article on the former but have absolutely no idea how to do the latter. How do I move "zfs send/receive" to IB? I hope there is a simpler way than what I propose above.

GB
 
I should have quoted this first. After re-reading this a bunch of times I see that it says "both corosync AND zfs send/receive to IB". I found a wiki article on the former but have absolutely no idea how to do the latter. How do I move "zfs send/receive" to IB? I hope there is a simpler way than what I propose above.

GB

for a migration with local disks (for which ZFS uses zfs send/receive), the target IP for the storage migration is determined over the cluster. it's not configurable separately.
 
What does "determined over the cluster" mean? I changed the cluster network to IB but the migrations are not using it. They can't be because they are still going at single Gbit speed. I expected them to be as fast as SCP at the least and after setting the "migrations_unsecure:1" I expected it to go even faster. My SCP tests were five times the throughput I am getting with "migrate" in the PVE GUI. I spent an entire day getting the cluster moved over to IB and renaming my zpools. Was that for nothing?

Looking at the corosync.conf file, it appears that it is using the IB network but how do I confirm? If you are saying that it has to use the cluster network for zfs send/receive then the cluster must not actually be using that network. I think, however, that it is but for some reason the migrations are not using it.

You didn't tell me what would happen if I changed the "node name" to "pve-1-corosync" in the corosync.conf. Anything at all? Disaster?

GB
 
I was able to confirm that migrations were using the VMBR0 network because the first line of the output of the qmmigrate job is the destination hostname and its IP. It is getting this from the /etc/hosts file. I confirmed this by editing my /etc/hosts and giving my host name the IB network's IP. I had to reboot the host in order for PVE to accept the new host IP. I also had to delete everything in /root/.ssh/known_hosts and then slogin from one host to the other and vice versa. This will then allow migrations to happen over the IB network. Otherwise changing the cluster network has no effect on the migration process.

upload_2016-7-7_16-29-26.png

Is there another way besides rebooting to make PVE accept a new IP for the Node Name? As you can see, I have not rebooted PVE-2 yet so it still has the Ethernet IP.
upload_2016-7-7_16-25-5.png

Am I causing other issues by doing this? So long as I have the /etc/hosts file the way I want it, DNS doesn't seem to have any negative effect. I have left the DNS A records for the hosts alone and they will still resolve to the original IPs so that I can still manage PVE through my Windows browser. I am also still able to slogin from my Centos VM I use to terminal into the PVE hosts. It seems PVE OS uses the hosts file before trying DNS. Hopefully, it keeps doing that.

I do worry now about what will happen when I do Kernel upgrades. Kernel upgrades will undo any of the OFED IB modules I have installed, which will need to be reinstalled. Hopefully, the IB modules that come with the PVE Kernel will work well enough for Corosync. For that matter, do I even need to keep Corosync on the IB network now?

One thing I noticed about the migration job has me puzzled. I watched the two virtual .raw volumes of my VM copy over incrementally as pictured above but after the two volumes copied, it then started a third set of increments, pictured below:

upload_2016-7-7_16-33-12.png

This is from the bottom but it differs from the first copy processes in that it shows a rate in MB/s. What is that? You can tell me RTFM if you want. Would that be under

man qmmigrate?

Sincerely,

GB
 

Attachments

  • upload_2016-7-7_16-23-58.png
    upload_2016-7-7_16-23-58.png
    18.7 KB · Views: 5
I was able to confirm that migrations were using the VMBR0 network because the first line of the output of the qmmigrate job is the destination hostname and its IP. It is getting this from the /etc/hosts file. I confirmed this by editing my /etc/hosts and giving my host name the IB network's IP. I had to reboot the host in order for PVE to accept the new host IP. I also had to delete everything in /root/.ssh/known_hosts and then slogin from one host to the other and vice versa. This will then allow migrations to happen over the IB network. Otherwise changing the cluster network has no effect on the migration process.

sorry, I misread the code - you are right, the migration always uses the IP retrieved via getaddrinfo (this is sometimes cached via the cluster file system, and I mistakenly assumed that it stores the corosync IP there and only uses getaddrinfo as fallback..).
View attachment 4008

Is there another way besides rebooting to make PVE accept a new IP for the Node Name? As you can see, I have not rebooted PVE-2 yet so it still has the Ethernet IP.
View attachment 4007

Am I causing other issues by doing this? So long as I have the /etc/hosts file the way I want it, DNS doesn't seem to have any negative effect. I have left the DNS A records for the hosts alone and they will still resolve to the original IPs so that I can still manage PVE through my Windows browser. I am also still able to slogin from my Centos VM I use to terminal into the PVE hosts. It seems PVE OS uses the hosts file before trying DNS. Hopefully, it keeps doing that.

/etc/hosts takes precedence over DNS servers, yes (this is not Proxmox specific ;))

I do worry now about what will happen when I do Kernel upgrades. Kernel upgrades will undo any of the OFED IB modules I have installed, which will need to be reinstalled. Hopefully, the IB modules that come with the PVE Kernel will work well enough for Corosync. For that matter, do I even need to keep Corosync on the IB network now?

see above - no you don't. anyway, after upgrading or installing a new kernel, just reinstall the IB modules before rebooting. the new kernel is not active until a reboot.

One thing I noticed about the migration job has me puzzled. I watched the two virtual .raw volumes of my VM copy over incrementally as pictured above but after the two volumes copied, it then started a third set of increments, pictured below:

View attachment 4009

This is from the bottom but it differs from the first copy processes in that it shows a rate in MB/s. What is that?

depending on the storage where a local disk is located, different tools (cp, rsync, zfs send/receive, dd) are used to copy/move the disk images. those different tools produce different output ;) in case of a live migration (which only works with shared storage, where no disk image has to be copied/moved, but the memory content of the VM has to be migrated), the output also looks a bit different. migration_unsecure only affects the latter (live migration for Qemu VMs).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!