Proxmox VE 5.0 released!

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Jul 4, 2017.

Thread Status:
Not open for further replies.
  1. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    #61 BloodyIron, Jul 5, 2017
    Last edited: Jul 5, 2017
  2. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    Oh, and I'm ever so glad that we can import VMs from other hypervisors now! That came faster than I thought :D

    Hopefully we can also get export too at some point, so that I can do all kinds of tasty stuff with that. But that's another day ;)
     
  3. Coneng

    Coneng New Member

    Joined:
    Aug 14, 2015
    Messages:
    13
    Likes Received:
    1
    Thank you.
    There was a typo "strech" should be "stretch"
     
  4. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    THIS is definitely something that NEEDS TO BE DOCUMENTED.

     
  5. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    I'm in the process of upgrading an environment I work with from 4.4 to 5.0, and I just tried migrating a test VM from a 4.4 box to 5.0, as in live migration, and it seems to fail every time. :/

    It looks like this version upgrade might be one of those ones where you will experience downtime, but I'm not 100% sure just yet.

    This kind of downtime though is really frustrating, and I really wish we could stop having this happen :/
     
    #65 BloodyIron, Jul 6, 2017
    Last edited: Jul 6, 2017
  6. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    I just upgraded the cluster, and the last node, I was able to live migrate a bunch of VMs on it, but now I can't migrate them off it. No matter what node it tries to migrate to I get:

    Code:
    ERROR: migration aborted (duration 00:00:01): Can't connect to destination address using public key

    EXCEPT it has the public keys for ALL the other nodes in the cluster, and at the CLI I can SSH from that box to other nodes in the cluster JUST FINE. I even restarted sshd on both ends, and rebooted one of the nodes fully. So now I have to shut down every VM to reboot this box, just because of what appears to be a bug.

    This... could be worse, but this is leading to more downtime than I was really hoping for >:|

    EDIT: okay this is actually rather a big deal. I can't even migrate the VMs when they're OFFLINE. I have to turn them all off, and reboot the node, and hope this shit works. This is feeling like an issue that could have been hammered out in beta.

    EDIT2: this looks related to a bond0 I have for vmbr0 to use, which is LACP of two NICs. This is the same setup on one of the other nodes, but on this problematic node it somehow didn't transition nicely or something in the upgrade. A bunch of logs, kern.log/syslog/messages are spewing:

    Code:
    vmbr0: received packet on bond0 with own address as source address
    
    It's spewing this non-stop. I am unsure if this is related to my node, or my switch at this point.

    EDIT3: I had to remove the node from the cluster, and fully reinstall it with 5.0, before I could rejoin it to the cluster and live migrate on and off it. It doesn't have LACP/bonding yet, I wanted to test migration first before doing that. Now to set that back up. I don't know what on earth went wonky in this upgrade, this is bizarre!

    EDIT4: I added the two interfaces to a bond0, literally identical configration to the other node, and I'm getting the "received packet on bond0 with own address" error AGAIN. It's spamming the logs just like before. This is ridiculous! The two nodes are literally identical hardware, yet this one is HATING the bonding. WTF?

    Oddly enough migrating onto and off the node works, despite the logs being spammed. The other node is not getting spammed in the logs. I am going to disable the LACP bonding on this one problematic node.
     
    #66 BloodyIron, Jul 6, 2017
    Last edited: Jul 6, 2017
  7. joblack

    joblack Member

    Joined:
    Apr 16, 2017
    Messages:
    37
    Likes Received:
    4
    Nice one.

    I am still wondering what the feature

    'live migration with local storage'

    exactly means. If I try this I still get a

    '
    2017-07-06 09:12:16 can't migrate local disk 'local-lvm:vm-104-disk-1': can't live migrate attached local disks without with-local-disks option
    2017-07-06 09:12:16 ERROR: Failed to sync data - can't migrate VM - check log'
     
  8. talos

    talos Member

    Joined:
    Aug 9, 2015
    Messages:
    43
    Likes Received:
    3
    I upgraded my 4 Node Cluster today. Live Migration and updating node by node did the job.

    Thanks for this great Release!
     
  9. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    929
    Likes Received:
    124
    Hello to all,

    In my case, for a 4 nodes cluster (hardware and software identically ) was ok only for 3 nodes. On one node the upgrade was fail because the pvedaemon was not able to start. The systemd logs was unable to show nothing useful as usual. So I start a fresh install on this node (5.x iso). Because I do not think that zfs on boot/root is not very safe, I have a dedicated SSD with lvm/ext4 for proxmox os only. But I use zfs on a mirror hdds in all my cluster nodes (I do not like to put all the eggs in the same basket). I remove this node from cluster...

    After I have reinstall this nodes, zpool import -f was show me the old data (vm ), and then I was able to join this node to my cluster. And I was also able to start my VM... on this node.

    All the recovery process was about 60 min. But if my setup has been made(sorry for my bad english) with zfs only, the recovery process. .. should take many many hours in my case.

    Please do not understand that zfs is bad, or proxmox is bad. I am bad because I fail in some situations :)

    For the performance part, all is ok as I can see in my case (upgrade from last 4.last to 5.last )

    Have a nice day to all, and also I want to thank for all people who push this product forward (devs, users, bug reporters and so on).
     
    BloodyIron likes this.
  10. gsupp

    gsupp Member

    Joined:
    Jun 27, 2017
    Messages:
    38
    Likes Received:
    14
  11. Nemesiz

    Nemesiz Active Member

    Joined:
    Jan 16, 2009
    Messages:
    678
    Likes Received:
    42
    Today I tested Proxmox installation disk. Few ZFS options become editable. Thanks for the progress.
     
  12. gsupp

    gsupp Member

    Joined:
    Jun 27, 2017
    Messages:
    38
    Likes Received:
    14
    Honestly this sounds like a networking issue more than a Proxmox issue. Have you tried googling with that error message? I'm finding a bunch of posts that sound similar to the issue you're having. Sorry, I know it's frustrating when things were working fine and now they aren't, especially when you have other nodes with the same config that are working. I assume LACP is enabled on the switch ports the server is plugged in to?
     
  13. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    Did you have any LACP going on? Live migration between nodes during upgrade did not work for me at all :( (but might have been due to... other... issues as my notes above outline).

     
  14. talos

    talos Member

    Joined:
    Aug 9, 2015
    Messages:
    43
    Likes Received:
    3
    My upgraded Cluster starts making trouble. LXC Containers failed to start, some start but fails about missing filesystem or ELF headers missing in pam.so and stuff like that. For me this looks like some sort of storage corruption with lxc?

    Proxmox is running on shared SAN with LVM.

    I am unable to enter this containers and migration to another host doesn't help. I have the same issue with CentOS7 and Ubuntu16.04 Containers.

    root@prox02:~# pct enter 501
    bash: macr:: command not found
    bash: macr;:: command not found
    bash: male;:: command not found
    bash: malt;:: command not found
    bash: maltese;:: command not found
    bash: Map;:: command not found
    bash: map;:: command not found
    bash: mapsto;:: command not found
    bash: mapstodown;:: command not found
    bash: mapstoleft;:: command not found
    bash: mapstoup;:: command not found
    bash: marker;:: command not found
    bash: mcomma;:: command not found
    bash: Mcy;:: command not found
    bash: mcy;:: command not found
    bash: mdash;:: command not found
    bash: mDDot;:: command not found
    bash: measuredangle;:: command not found
    bash: MediumSpace;:: command not found
    bash: Mellintrf;:: command not found
    bash: Mfr;:: command not found
    bash: mfr;:: command not found
    bash: mho;:: command not found
    bash: micro:: command not found
    bash: micro;:: command not found
    bash: mid;:: command not found
    bash: midast;:: command not found
    bash: midcir;:: command not found
    bash: middot:: command not found
    bash: middot;:: command not found
    bash: minus;:: command not found
    bash: minusb;:: command not found
    bash: minusd;:: command not found
    bash: minusdu;:: command not found
    bash: MinusPlus;:: command not found
    bash: mlcp;:: command not found
    bash: mldr;:: command not found
    bash: mnplus;:: command not found
    bash: models;:: command not found
    bash: Mopf;:: command not found
    bash: mopf;:: command not found
    bash: mp;:: command not found
    bash: Mscr;:: command not found
    bash: mscr;:: command not found
    bash: mstpos;:: command not found
    bash: Mu;:: command not found
    bash: mu;:: command not found
    bash: multimap;:: command not found
    bash: mumap;:: command not found
    bash: nabla;:: command not found
    bash: Nacute;:: command not found
    bash: nacute;:: command not found
    bash: nang;:: command not found
    bash: nap;:: command not found
    bash: napE;:: command not found
    bash: napid;:: command not found
    bash: napos;:: command not found
    bash: napprox;:: command not found
    bash: natur;:: command not found
    bash: natural;:: command not found
    bash: naturals;:: command not found
    bash: nbsp:: command not found
    bash: nbsp;:: command not found
    bash: nbump;:: command not found
    bash: nbumpe;:: command not found
    bash: ncap;:: command not found
    bash: Ncaron;:: command not found
    bash: ncaron;:: command not found
    bash: Ncedil;:: command not found
    bash: ncedil;:: command not found
    bash: ncong;:: command not found
    bash: ncongdot;:: command not found
    bash: ncup;:: command not found
    bash: Ncy;:: command not found
    bash: ncy;:: command not found
    bash: ndash;:: command not found
    bash: ne;:: command not found
    bash: nearhk;:: command not found
    bash: neArr;:: command not found
    bash: nearr;:: command not found
    bash: nearrow;:: command not found
    bash: nedot;:: command not found
    bash: NegativeMediumSpace;:: command not found
    bash: NegativeThickSpace;:: command not found
    bash: NegativeThinSpace;:: command not found
    bash: NegativeVeryThinSpace;:: command not found
    bash: /etc/bash.bashrc: line 87: unexpected EOF while looking for matching `''
    bash: /etc/bash.bashrc: line 88: syntax error: unexpected end of file

    Found this in dmesg:

    [22154.605297] EXT4-fs error (device dm-24): ext4_find_entry:1473: inode #2332: comm bash: checksumming directory block 0
    [22154.607208] EXT4-fs error (device dm-24): ext4_find_entry:1473: inode #2332: comm bash: checksumming directory block 0
    [22154.609036] EXT4-fs error (device dm-24): ext4_find_entry:1473: inode #2332: comm bash: checksumming directory block 0
    [22154.610547] EXT4-fs error (device dm-24): ext4_find_entry:1473: inode #2332: comm bash:
     
    #74 talos, Jul 6, 2017
    Last edited: Jul 6, 2017
    BloodyIron likes this.
  15. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    You're not bad at all! I think you sufficiently explained your situation :) Nice! :D

    Perhaps reconsider ZFS in the future, but I understand your hesitation.

     
  16. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    1. I googled the snot out of the error, and likely found a bunch of results you did too. Unfortunately what I found did not help at all.
    2. I tried doing very drastic stuff including rebooting the switch, reconfiguring the ports, switching which ports are used on the switch in the LACP bond, reconfiguring the bond on the node, and a few other things in my mind. Still fails. I even did the bond after reinstalling of the node (full wipe of the node), and it still threw the error. So to me it looks like a bug with a package because the other node, which is literally identical hardware, isn't throwing the issues, on the same switch.
    3. I'm not sure what more I could have tested, so I just ended up undoing the LACP bond just to have stability. Hopefully this gets fixed.


     
  17. gsupp

    gsupp Member

    Joined:
    Jun 27, 2017
    Messages:
    38
    Likes Received:
    14
    Looks like you covered everything I could think of as well. Very strange. Does the working node have different package versions than the one that doesn't work? Like were they both upgraded to 5.0 or just the problematic one?
     
  18. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    I was upgrading the whole cluster, which consisted of two nodes that operate 24/7, two nodes that are turned on for labbing purposes, then turned off when not needed (due to their power inefficiencies and loudness). All nodes were previously 4.4 with the latest updates. I upgraded the lab nodes first to 5.0 (release).

    When I upgraded the lab node 1, I was not able to migrate VMs on or off it between it and lab node 2, or other nodes. I can't remember the error but it was extremely vague and I would say useless.

    I then upgraded lab node 2, then "prod" node 2 (which had the VMs moved off of it). I was able to move VMs between lab node 1, 2 and prod node 2 at this point. Keep in mind prod node 1 was still on 4.4. lab node 1 and 2 has no LACP going on, but both prod nodes have 2x1gige LACP going on, and have been doing that for a bunch of weeks now.

    prod node 2 threw no issues with the LACP after upgrade, I had to change nothing about its network config.

    And then I upgraded prod node 1.... I saw no errors in the upgrade process, it appeared to come up just fine. I was not aware that the logs were throwing errors at this point. I then migrated one VM onto it, and it seemed good, so I migrated a whole bunch of VMs onto it, and that's when things got... concerning.

    In the second set of migrations I batch migrated a whole bunch of VMs with 3 in parallel at once. It went so fast, I figured I would try in the reverse direction, but with 6 in parallel at once... and that's when I got the error about the public key... on EVERY VM trying to migrate.

    I then kicked to the CLI and was able to actually SSH from prod 1 to prod 2 with zero problems, so the keys were still trusted, the error was bunk. I tried a whole bunch of things, such as checking if root was allowed, verifying if the key files were accessible, etc. It all seemed to be correct, but the error was being thrown. About this time I started looking at the logs, and oh shit, I saw the bond0 error now too.

    That's when the steps I took kind of got hazy in my memory, but my BOFH gene started kicking in too. Anger grew, because none of this made sense.

    I tried removing the node and re-adding it to the cluster without reinstalling it a few times. Removing was successful, re-adding was... not so successful. In the end wiped it and fed it the same configs. Except even after that, again, bond0 throwing errors.

    Another thing that's weird and I'm not sure why... the interface names changed. prod 2 the interface names stayed the same, but when I fresh reinstalled prod 1, the interface names changed from eth1 / eth0, to something indicative of a driver or something. I have not yet found the area where I can change this, as I would prefer eth1 / eth0. But I'm tolerating this for now because my care factor is dropping.

    All in all, I spent way more time on this than I should have. The upgrade for 3 of the 4 nodes went really smoothly, and I seriously cannot fathom why prod 1 has given me so much trouble. Nothing adds up!

    So to ACTUALLY answer your question. So far as I can tell all the nodes have the same package versions.


     
  19. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    BTW I'm loving the little nuanced GUI improvements, like:

    1. Shift click to select multiple VMs during migration. This is really convenient! (should be documented so others know)
    2. Right click on nodes in the left list to issue node-centric commands, like mass migrate.
    3. Colourising of the logs and other things, makes it much easier to visually pick up on things that need my attention.
    With any luck I'll find more things as I go :)
     
  20. BloodyIron

    BloodyIron Member

    Joined:
    Jan 14, 2013
    Messages:
    193
    Likes Received:
    4
    Okay so the node I upgraded from 4.4 to 5.0 has ifconfig

    but the node I rebuilt from scratch 5.0 does not have ifconfig... what.. the hell... :(

    EDIT: fresh installs do not get the package "net-tools", but upgraded ones retain it. This is how I got my precious ifconfig back.
     
    #80 BloodyIron, Jul 6, 2017
    Last edited: Jul 6, 2017
Thread Status:
Not open for further replies.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice