Commands kubernetes/helm

Install K3s

Install nvidia-container

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Cheatsheet

Get Cluster Nodes

kubectl get nodes

Apply conf

kubectl apply -f path/to/folder

Remove applied conf from cluster

kubectl delete -f path/to/folder

Kubectl error: the object has been modified; please apply your changes to the latest version and try again

Remove these lines from the file:

creationTimestamp:
resourceVersion:
selfLink:
uid:

source

Shutdown k3s node properly

Drain all pods from the node

kubectl drain <node-name> --ignore-daemonsets
## Force drain
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --disable-eviction --force

Make the node schedulable again

kubectl uncordon <node-name>

K3s Debug

Check logs

journalctl -u k3s
# Show only last 100 lines
journalctl -u k3s -n 100
# Cat the last 100 lines
journalctl -u k3s -n 100 --no-pager

** Restart k3s**

sudo systemctl restart k3s

Files copy

Copy all files in folder to pod

for file in ./*; do kubectl cp "$file" namescpace/podName:/path/to/folder/$(basename "$file"); done

Handle longhorn faluted volumes

It could be because the node has a pressure issue, scale down the deployments in the node and try to fix the issue.

Stuck in faulted and detaching state

# list faluted volumes
kubectl get volumes.longhorn.io -n longhorn-system | grep -i faulted
# get longhorn logs
kubectl logs -n longhorn-system -l app=longhorn-manager


## Manager is not running

# Check if the manager pod is running
kubectl get pods -n longhorn-system -o wide | grep mynodename



## Set replica to O

kubectl get pods --all-namespaces --field-selector spec.nodeName=mynodename


# Scale deployments in specific namespaces
kubectl scale deployment --all --replicas=0 -n mynamespace

# Scale deployments in specific namespaces to 1
kubectl scale deployment --all --replicas=1 -n mynamespace

CNI locked ip

In my case, the CNI plugin was locking the IP address of the pod, preventing it from being released. To resolve this, I had to delete the reserved ip

# Verify the number of reserved IPs
sudo ls /var/lib/cni/networks/cbr0/ | grep "10.42" | wc -l

# Stop kubelet temporarily
sudo systemctl stop k3s.service

# Backup and clean the CNI network state
sudo cp -r /var/lib/cni/networks/cbr0/ /var/lib/cni/networks/cbr0.backup
sudo rm -f /var/lib/cni/networks/cbr0/10.42.0.*
sudo rm -f /var/lib/cni/networks/cbr0/last_reserved_ip.0

# Keep only the lock file
sudo touch /var/lib/cni/networks/cbr0/lock

# Start kubelet
sudo systemctl start k3s.service

Delete a stuck namespace

kubectl get namespace "funky-elephant" -o json | jq 'del(.spec.finalizers)' | kubectl replace --raw /api/v1/namespaces/funky-elephant/finalize -f -