Volume Mounting Troubleshooting

This article describes how to resolve pods that are in status of ContainerCreating, and command kubectl describe pod <pod-name> indicates volume mounting failures in the events.

Mount failed for “PVC already exists”

The error message is similar to

Warning  FailedMount  2m9s (x238 over 7h52m)  kubelet  MountVolume.MountDevice failed for volume "pvc-2f77910e-068f-488c-9e5e-929f88564a76" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-000c-rook-central-0000000000000001-580865e7-8ef0-4eca-aecb-829f9712f4ee already exists

This type of failure is caused by networking issues. Here are some work arounds to fix the issue:

Restart the kube-proxy pod, and the csi-cephfsplugin pod or csi-rbdplugin pod on the node.

Examine the log of the kube-proxy pod, if there are repeating errors connecting to the API server at https://67.58.53.147:6443:

E0403 10:47:08.012312       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://67.58.53.147:6443/api/v1/nodes?fieldSelector=metadata.name%3Dhcc-nrp-shor-c5934.unl.edu&resourceVersion=9989991954": dial tcp 67.58.53.147:6443: connect: no route to host

Delete the kube-proxy pod and wait for it to restart.

Then delete the csi-*plugin pod depending on the StorageClass of the volume. If it’s a cephfs storage, delete the csi-cephfsplugin pod. If it’s a Get the volume’s StorageClass by command kubectl describe pv, e.g.:

kubectl describe pv/pvc-f67277a5-dd6e-4150-9937-aac1b88b8bf9 | grep StorageClass
StorageClass:    rook-ceph-block-east

In the above example, its a ceph block storage, so delete the csi-rbdplugin pod. If it’s a cephfs storage, delete the csi-cephfsplugin pod. Monitor the pods to start.

Delete the volumeattachment and restart the csi-cephfsplugin or csi-cephfsplugin pod.

Run command kubectl get volumeattachment | grep <PVC-name> to get the volumeattachment of the volume, and then run kubectl delete volumeattachment csi-xxx to delete the volumeattachment. After this, delete the csi-cephfsplugin or csi-cephfsplugin pod depending on the StorageClass and wait for the volume to mount.

Multi-Attach error

Multi-attach error only applies to block volume, since they are not allowed to be attached to multiple pods. The error message looks like:

[Warning] Multi-Attach error for volume "pvc-24d9f8f7-82ac-411c-9f2b-25ee93e7259e" Volume is already exclusively attached to one node and can't be attached to another.

First, examine if it is really attached to a running pod. If there is no running pod that mounts this volume, find the where the volume is attached to, and apply the steps in as the “PVC already exists” case. the command to find the attachment is kubectl get volumeattachment | grep <PVC-name> and the node should be listed there.

There is a chance that the pod still could not mount the volume with the same multi-attach error, but kubectl get volumeattachment | grep <PVC-name> could not find the attachment anymore. In this case reboot the node with the attachment previously.

If and only if, there’s a stale or orphaned Ceph RBD lock that was not properly released:

⚠️ Make sure you follow the steps carefully to not damage data or filesystem.

1. Check for RBD Lock

Open a shell into the Ceph tools pod:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
rbd lock list <pool-name>/csi-vol-<uuid>

Example output:

There is 1 exclusive lock on this image.
Locker            ID                     Address
client.<id>       auto <lock-id>         <ip>:0/<session-id>

2. Identify the Node by IP

Use NetBox or node inventory system to find which node matches the IP address from the lock entry. Then:

Confirm that no nodes are using the volume.

3. Remove the Lock (Only If Safe)

⚠️ Only run this command if you are absolutely certain that no pods are using the volume anymore.

rbd lock remove <pool-name>/csi-vol-<uuid> "auto <lock-id>" client.<id>

4. Retry Pod or PVC

Once the lock is removed, retry launching the pod or let Kubernetes rebind the PVC.

”Permission denied” error

The error message is similar to:

 MountVolume.SetUp failed for volume "pvc-5b0a3f01-7914-45c3-913b-28a49fe3336f" : rpc error: code = Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/rook-system.ceph fs.csi.ceph.com/e34a4060e2564977fbf8af9a8e7fde65500d9c79898a1739854084c48ade1c6f/globalmount: permission denied

If this happens, reboot the node.

Errors in ceph clusters

If the above steps didn’t get the volume issues fixed, check the status of the ceph cluster. For example, get into the shell of the ceph-tools pod of the corresponding ceph cluster, and run ceph -s command to check the status. If there are errors regarding mds or mgr services, restart the corresponding pods.

StorageClass and accessModes mismatch

User configuration error is another reason for volumes fail to mount. StorageClass cephfs allows multiple pods to attach the same volume, so the accessModes should be ReadWriteMany. StorageClass ceph-block is block storage and can only be attached to a single pod, so the accessModes should be ReadWriteOnce. If the StorageClass and access mode are mismatched, the volume could not mount.

Here’s an example for how to check the accessModes:

kubectl get -n mizzou pvc/claim-hikf3-40mail-2emissouri-2eedu -o yaml | grep accessModes: -A 1

Here the PVC name is claim-hikf3-40mail-2emissouri-2eedu, namespace is mizzou, and I added a -A 1 to the end to display the line below accessModes

accessModes:
- ReadWriteOnce

Advise the user to update the configuration.

How to reboot a node:

Admins can reboot a node using the Ansible playbook: ansible-playbook reboot.yaml -l {node name}

Steps to reboot a node manually:

Drain node: kubectl drain {node name} --ignore-daemonsets --delete-emptydir-data --force

SSH into node and reboot: ssh {user}@{node name} reboot

If GPU, check if nvidia-smi is up: nvidia-smi

Uncordon Node: kubectl uncordon {node name}

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.