We have several ProxmoxVE 6.2-6 hosts which are running LXC containers (Centos 7) which house an application. Each of the containers has four mounts from the underlying storage. Some of these mounts are common between the containers (e.g. for data which they only read), others are individual (e.g. logging directories which are defined with their container ID in the underlying mounted path). The storage is all local, it is not mounted onto the server from elsewhere. It is not clustered. Servers are Dell R740XD and storage is all SSD.
On a regular basis we want need to replace the container image (for the usual sorts of reasons, e.g. patching). At which point the application is stopped across these containers. The containers themselves are stopped and destroyed (using pct commands). The underlying data disk structures which serve the mounts is not touched. Nothing is deleted or removed from there.
The containers are then created using a script which does the following:
pct create ${ID} ${IMAGE} -cores ${CPU_CORES} -hostname ${HOSTNAME} -memory ${RAM_MB} -net0 ${NET0} -onboot 1 -ostype ${OSTYPE} -ssh-public-keys ${KEY}
pct resize ${ID} rootfs ${DISK_GB}G
mkdir -p ${DATA_HOST_MOUNT}
pct set ${ID} -mp0 ${DATA_HOST_MOUNT},mp=${DATA_CONT_MOUNT}
sleep 1
mkdir -p ${MIRROR_HOST_MOUNT}
pct set ${ID} -mp1 ${MIRROR_HOST_MOUNT},mp=${MIRROR_CONT_MOUNT}
sleep 1
mkdir -p ${QUER_HOST_MOUNT}
pct set ${ID} -mp2 ${QUER_HOST_MOUNT},mp=${QUER_CONT_MOUNT}
sleep 1
mkdir -p ${LOGS_HOST_MOUNT}
pct set ${ID} -mp3 ${LOGS_HOST_MOUNT},mp=${LOGS_CONT_MOUNT}
This is sequence is run at 1 min intervals to create each container in turn. The variables are being passed correctly and I would note that the mkdir is unnecessary really after the first intiial creation as, as said before, the directory will already exist in subsequent creations. The mount points within the containers already exist in the image.
The problem is that if we bring up a set of 21 containers in this way then ~17 of them will fail to mount one or more of the mounts, frequently mp2, and we don't know why. If you subsequently stop the container. Wait two minutes. and then start it again. Then on the restart it will mount all of its mounts correctly.
Does anyone have any ideas what is going on here and what logs we can look into find more information about why this is failing in the hope of stopping it happening?
On a regular basis we want need to replace the container image (for the usual sorts of reasons, e.g. patching). At which point the application is stopped across these containers. The containers themselves are stopped and destroyed (using pct commands). The underlying data disk structures which serve the mounts is not touched. Nothing is deleted or removed from there.
The containers are then created using a script which does the following:
pct create ${ID} ${IMAGE} -cores ${CPU_CORES} -hostname ${HOSTNAME} -memory ${RAM_MB} -net0 ${NET0} -onboot 1 -ostype ${OSTYPE} -ssh-public-keys ${KEY}
pct resize ${ID} rootfs ${DISK_GB}G
mkdir -p ${DATA_HOST_MOUNT}
pct set ${ID} -mp0 ${DATA_HOST_MOUNT},mp=${DATA_CONT_MOUNT}
sleep 1
mkdir -p ${MIRROR_HOST_MOUNT}
pct set ${ID} -mp1 ${MIRROR_HOST_MOUNT},mp=${MIRROR_CONT_MOUNT}
sleep 1
mkdir -p ${QUER_HOST_MOUNT}
pct set ${ID} -mp2 ${QUER_HOST_MOUNT},mp=${QUER_CONT_MOUNT}
sleep 1
mkdir -p ${LOGS_HOST_MOUNT}
pct set ${ID} -mp3 ${LOGS_HOST_MOUNT},mp=${LOGS_CONT_MOUNT}
This is sequence is run at 1 min intervals to create each container in turn. The variables are being passed correctly and I would note that the mkdir is unnecessary really after the first intiial creation as, as said before, the directory will already exist in subsequent creations. The mount points within the containers already exist in the image.
The problem is that if we bring up a set of 21 containers in this way then ~17 of them will fail to mount one or more of the mounts, frequently mp2, and we don't know why. If you subsequently stop the container. Wait two minutes. and then start it again. Then on the restart it will mount all of its mounts correctly.
Does anyone have any ideas what is going on here and what logs we can look into find more information about why this is failing in the hope of stopping it happening?