Node Management
Rebooting a node
Key Considerations
- Check if the node has any
rook-ceph-osd-*pods. Verify the health of the corresponding Ceph cluster and bring down one node at a time. - Check for
haproxy-ingress-*pods. If the node will be down for an extended period, disable its record in Constellix DNS. - Check if the node has the
nautilus.io/linstor-serverlabel. This node serves as a Linstor server. Some Linstor servers are redundant, while others are critical. - Check if the node has the
nautilus.io/bgp-speakerlabel. There are two nodes used for MetalLB IPs—ensure one remains active. - Check if the node has the
node-role.kubernetes.io/masterlabel. Rebooting this node will make the cluster inaccessible unless it’s not an Admiralty virtual node.
Prerequisites
- Install Ansible on your local computer.
- Clone the repository of Ansible playbooks:
git clone https://gitlab.nrp-nautilus.io/prp/nautilus-ansible.git- Pull the latest updates from the playbook repository:
cd nautilus-ansible;git pullReboot a Node Due to GPU Failure
Use the following command to reboot the node:
ansible-playbook reboot.yaml -i nautilus-ansible/nautilus-hosts.yaml -l <nodename>Special Instructions to Reboot Ceph Nodes
To maintain redundancy in the Ceph cluster, only one node can be rebooted at a time.
Run this command to enter the rook-ceph-tools pod shell. Replace <namespace> with the appropriate Ceph cluster namespace (e.g., rook, rook-east, rook-pacific, rook-haosu, rook-suncave):
kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={.items..metadata.name}) -- bashOnce inside the pod shell, run:
watch ceph health detailWait until [WRN] OSD_DOWN: 1 osds down disappears from the ceph health detail output before rebooting the next node.
Recycling a node
If possible, do
kubeadm reseton the nodeDelete the node from kubernetes cluster
Delete the node from netbox
Close all gitlab issues related to the node
Check if there are any
volumeattachmentsleft for the node in kubernetes

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.