Datacenter Whitepaper for Proxmox Management

Chris Rivera · Mar 26, 2013

I created a white paper for our staff to troubleshoot Proxmox issues themselves instead of all defaulting to me. When they cannot solve the issue i step in and handle the more important issues.

I post this here to help anyone who has a similar issue that this can help you with

Cloud / VPS Troubleshoot White Paper

Things to ask yourself:

Is your proxmox up to date?
Did you research the issue on proxmox communuity forum? (http://forum.proxmox.com/forums/16-Proxmox-VE-2-x-Installation-and-configuration)

OPENVZ:

Things to note:

OPENVZ:
- vzlist does not always show you all containers on a node
  - If you want to be sure of what containers are on what node navigate to:
    - Container Config Files:
      - /etc/pve/nodes/{nodehostname}/openvz
        (this directory will contain all {vmid}.conf files that are required for proxmox boot the container)
    - Container Client Server Data:
    - /var/lib/vz/private (this is where running vms are stored)
    - /var/lib/vz/root (this is where shut down vms are stored)
Proxmox web interface node diskspace is not a representation of the full HDD. This only shows the space available for proxmox operating filesystem… which calculates to be 10% of 1tb hdd to equal 100gb.

VZQuota failed on container boot

Problem Example:
root@proxmox1a:~# vzctl start 973
Starting container ...
vzquota : (warning) Incorrect quota shutdown for id 973, recalculating disk usage
vzquota : (error) quota check : lstat `85EC32562D92': Input/output error
vzquota on failed [1]
root@proxmox1a:~# vzctl stop 973
Unable to stop: container is not running
root@proxmox1a:~# vzctl start 973
Starting container ...
vzquota : (warning) Incorrect quota shutdown for id 973, recalculating disk usage
vzquota : (error) quota check : lstat `85EC32562D92': Input/output error
vzquota on failed [1]

Solution::
· vzquota off {vmid}
· vzquota on {vmid}
· vzctl start {vmid}

Solution Example:
root@proxmox1a:~# vzquota off 973
vzquota : (error) Quota is not running for id 973
vzquota : (warning) Repairing quota: it was incorrectly marked as running for id 973
root@proxmox1a:~# vzquota on 973
root@proxmox1a:~# vzctl start 973
Starting container ...
vzquota : (warning) Quota is running for id 973 already
Container is mounted
Adding IP address(es): 199.195.214.235
Setting CPU units: 1000
Setting CPUs: 2
Container start in progress...
root@proxmox1a:~# vzctl enter 973
entered into CT 973

Insufficient Disk Space

On the Host Node:

Sometimes the host node become full on one of the lvm partitions it has (/var/lib/vz). This can casue a vm to not be created or stop a starting vm from connecting to the internet even though it is up and running and ifconfig shows an ipaddress.

Solution::

Df –h (look for full file system)
Navigate to that directory to clear out any old files
Clear old logs:
- Cd /var/log
- Rm *.1
- Rm *.gz
Then go back to trying what you originally were trying to accomplish

Solution2::

Resize the full lvm partition while the server is running with no downtime. Contact management to get this done

Device or Resource Busy (last updated 3-27-2012) no need to reboot

This means that there is an issue with the node that you are using trying to write to the /etc/pve mount. This mount is controlled by pve-cluster service. If you stop the pve-cluster service this till remove the mount from the node /etc/pve/. When you start the pve-cluster service this will remount the /etc/pve data which contains all information for the cluster to work.

Solution::

service pve-cluster stop
service pve-cluster start

If the solution above fails:

log into all the nodes and run:

service cman stop ( stops all cluster services / communication )
service pve-cluster stop ( removes the busy drive / resource)
service pve-cluster start ( remounts the /etc/pve/ so its writable)
service cman start ( stats all cluster services again )

I use puttyCS to control multiple instances of putty so i can run commands across the whole cloud in 1 interface.

Run each command on all nodes in cluster until you finished all nodes then run the next command

Example:

service cman stop (node1)
service cman stop (node2)
service cman stop (node3)

service pve-cluster stop (node1)
service pve-cluster stop (node2)
service pve-cluster stop (node3)

service pve-cluster start (node1)
service pve-cluster start (node2)
service pve-cluster start (node3)

service cman start (node1)
service cman start (node2)
service cman start (node3)

by this time you should be online and good to go!

Service cman stop / restart fails

Cman will only present problems when:

You forcefully kill corosync process and try to start it again without rebooting the node
We are having any slight - major network disturbance ( DDoS, Broadcast, Multicast Storm)
The service crashed

1) If you cannot stop/ restart the process ( service cman stop / service cman restart) check the Miami network in cacti for incoming outgoing bandwidth. (very sensitive ive seen as low as 300-400m direct to a node in the cluster affect cman)
2) Check the Miami switch for any Multicast storm (show storm-control multicast)
a) We had an issue that the switch noticed multi-cast storm and began filtering/ blocking multicast communication. Multicast is the clusters way of communicating with each other)

Example of multicast storm
Dedicated Client caused cman to stop working by sending multicast attack. This can only be tracked in the switch since multicast is not large bandwidth usage and cannot be seen in cacti.

cisco3750-1.mia.fortatrust.com#show storm-control multicast
Interface Filter State Upper Lower Current
--------- ------------- ----------- ----------- ----------
Gi1/0/1 Forwarding 56.00% 56.00% 0.00%
Gi1/0/2 Forwarding 56.00% 56.00% 0.00%
Gi1/0/3 Forwarding 56.00% 56.00% 52.92%
Gi1/0/4 Link Down 56.00% 56.00% 0.00%
Gi1/0/5 Link Down 56.00% 56.00% 0.00%
Gi1/0/6 Forwarding 56.00% 56.00% 5.31%
Gi1/0/7 Link Down 56.00% 56.00% 0.00%
Gi1/0/8 Link Down 56.00% 56.00% 0.00%

This client was sending so much multicast that all nodes showed offline… while they were still accessible via ssh and not being slammed with DDoS. Once we saw this client utilizing so much we disconnected them and the cluster instantly came back online with no need to restart any services.

Service pve-cluster start/restart fails

Restarting pve cluster filesystem: pve-clusterfuse: failed to access mountpoint /etc/pve: Transport endpoint is not connected
[main] crit: fuse_mount error: Transport endpoint is not connected
[main] notice: exit proxmox configuration filesystem (-1)
(warning).

This happens becasue someone might be in the /etc/pve/*/*/* directories when the pve-cluster stop command was run. This causes the system to not properly umount the directory.

Solution::

cd / # navigate out of the mount area
umount -f /etc/pve # force to unmount the directory
service pve-cluster start

Cannot Terminate:: Container is currently mounted (umount first)

When you turn off a vm this will unmount it from its active location to a spot where all other off vms are stored. This doesn’t work always and may need you to

Solution1::

Manually click the umount button at the top right section of the screen in the proxmox interface where you find the other buttons such as { migrate, start, stop, remove }

Solution2::

If solution 1 did not work… this means there is an issue with the proxmox folders for active / non active vms.

** Before removing anything from the cluster via CLI or Proxmox Interface make sure WHMCS auto provision action has been ran ( terminate ). All automation rules need to be run before any manual actions are taken.

Mkdir /var/lib/vz/root/{vmid}
Mkdir /var/lib/vz/private/{vmid}
Vzctl umount {vmid}
Vzctl destroy {vmid}

Nodes Out of Sync

This system is ultra-sensitive to issues within the network (large / small ddos / broadcast / multicast storm ). DDoS is the only one that is easily recognized by the cacti system. If there is alot of bandwidth hitting 1 of the nodes on the cluster the cluster will react by loosing sync.
** All vps and node should still be online, accessible and providing service.

Solution::

1. Open up cacti and make sure the miami network is clear of any inbound / outbound ddos
2. Log onto the node via SSH
2 a: type clustat
2 b: make a list of the servers that show offline.
3. Log onto the nodes that are red via ssh.
3 a: type service pve-cluster restart (wait 1-2 minutes)

If the node or nodes still show offline... there may be an issue within the network. Let it be... try to come back and troubleshoot again later.

If the problem will not fix itself and you have come back a few times to get it online... try rebooting the node. to successfully reboot a node and not have issues when it comes back online do this:

service vz stop # stops all running containers
shutdown -rf now # reboots the box with fsck skipped

When the node comes back online it should be fixed. If not this needs to be brought to management for IMMEDIATE review

Vm cannot connect to the Internet

When this happens normally stopping the container, then starting the container will fix the issue. ** DO NOT RESTART… STOP then START ** This will remove the network route on shutdown then re add the network route on startup clearing network related issues. If the container still does not have access to the internet try to ping the gateway ( the hostnode proxmox ip )

Example: Vm222 on node 2

Solution::

1. Ssh to node 2
2. Vzctl enter 222
3. Ping 63.217.249.159 (node2 proxmox ip) this should always work
a. if you cannot ping the hostnode:
· Service networking restart (on host node)
· Vzctl stop 222
· Vzctl start 222
· Vzctl enter 222
· Ping anything outside
4. Ping 8.8.8.8 (this should work if the server can get outside of the network)
a. if this works you can get outside of the network and there might be an issue with /etc/resolv.conf
b. if this doesn’t work try stop ( vzctl stop 222) then start (vzctl start 222) the vm
5. Exit
6. Exit
7. This will leave you back on the hostnode cli
8. Vzctl stop 222
9. Vzctl start 222
10. Check for any errors related to arp… eth0 …. Vmbr0
a. If there was an arp error that means there is an ip conflict. ARP in Miami is locked so conflicts will not happen… the first mac address to claim the ip address keeps it, meaning you will need to re ip this vm
i. Ive found this to be a small issue with clients with multiple ips. Some ips work but they may have 1 or 2 that have been reassigned to another client. This happens because we don’t have ips to provision new servers and may have assigned an ip of a server that was suspended yesterday but paid and is active online today.
b. If there is no issue check to see if you can ping google.com

If this still doesn’t solve your connectivity issue bring this up to management for review. This will allow us to have someone who can dedicate some time to further look into the issue.

JimBeam · Mar 27, 2013

Nice, thanks a lot for posting this!

Chris Rivera · Mar 27, 2013

your welcome... I will keep updating as i come across and solve issues. If this is more useful for others i hope Proxmox Dev can sticky this.

udo · Mar 27, 2013

Chris Rivera said:
your welcome... I will keep updating as i come across and solve issues. If this is more useful for others i hope Proxmox Dev can sticky this.

Hi,
instead of sticky it's perhaps better to create an wiki-page for this?

BTW. some commands starts with an capital (bad editor?).

Udo

Chris Rivera · Mar 27, 2013

udo, thanks for point this out. I will go through the file and make sure the commands are all lowercase and in proper format.

Word automatically capped some of the commands that were on a new line.

**Wiki page would be good but unless i have rights to update will not be updated unless someone makes it a point to update the content when solutions are found

Search

Search

Datacenter Whitepaper for Proxmox Management

Chris Rivera

Guest

JimBeam

Member

Chris Rivera

Guest

udo

Distinguished Member

Chris Rivera

Guest

We value your privacy