Commands kubernetes/helm
Install K3s
Install nvidia-container
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Cheatsheet
Get Cluster Nodes
kubectl get nodes
Apply conf
kubectl apply -f path/to/folder
Remove applied conf from cluster
kubectl delete -f path/to/folder
Kubectl error: the object has been modified; please apply your changes to the latest version and try again
Remove these lines from the file:
creationTimestamp:
resourceVersion:
selfLink:
uid:
Shutdown k3s node properly
Drain all pods from the node
kubectl drain <node-name> --ignore-daemonsets
## Force drain
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --disable-eviction --force
Make the node schedulable again
kubectl uncordon <node-name>
K3s Debug
Check logs
journalctl -u k3s
# Show only last 100 lines
journalctl -u k3s -n 100
# Cat the last 100 lines
journalctl -u k3s -n 100 --no-pager
** Restart k3s**
sudo systemctl restart k3s
Files copy
Copy all files in folder to pod
for file in ./*; do kubectl cp "$file" namescpace/podName:/path/to/folder/$(basename "$file"); done
Handle longhorn faluted volumes
It could be because the node has a pressure issue, scale down the deployments in the node and try to fix the issue.
Stuck in faulted and detaching state
# list faluted volumes
kubectl get volumes.longhorn.io -n longhorn-system | grep -i faulted
# get longhorn logs
kubectl logs -n longhorn-system -l app=longhorn-manager
## Manager is not running
# Check if the manager pod is running
kubectl get pods -n longhorn-system -o wide | grep mynodename
## Set replica to O
kubectl get pods --all-namespaces --field-selector spec.nodeName=mynodename
# Scale deployments in specific namespaces
kubectl scale deployment --all --replicas=0 -n mynamespace
# Scale deployments in specific namespaces to 1
kubectl scale deployment --all --replicas=1 -n mynamespace
CNI locked ip
In my case, the CNI plugin was locking the IP address of the pod, preventing it from being released. To resolve this, I had to delete the reserved ip
# Verify the number of reserved IPs
sudo ls /var/lib/cni/networks/cbr0/ | grep "10.42" | wc -l
# Stop kubelet temporarily
sudo systemctl stop k3s.service
# Backup and clean the CNI network state
sudo cp -r /var/lib/cni/networks/cbr0/ /var/lib/cni/networks/cbr0.backup
sudo rm -f /var/lib/cni/networks/cbr0/10.42.0.*
sudo rm -f /var/lib/cni/networks/cbr0/last_reserved_ip.0
# Keep only the lock file
sudo touch /var/lib/cni/networks/cbr0/lock
# Start kubelet
sudo systemctl start k3s.service
Delete a stuck namespace
kubectl get namespace "funky-elephant" -o json | jq 'del(.spec.finalizers)' | kubectl replace --raw /api/v1/namespaces/funky-elephant/finalize -f -