Intro
One morning I woke up and tried to access my gitea just to find that it wasn’t running.
I checked my cluster and found that the whole thing was dead as meat. I quickly jumped in and ran k get pods -A
to see what’s
going on. None of my services worked.
What immediately struck my eye was a 100+ pods of my fork_updater cronjob. The fork_updater cronjob which runs once a month, looks like this:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: fork-updater
namespace: fork-updater
spec:
schedule: "* * 1 * *"
jobTemplate:
spec:
template:
spec:
volumes:
- name: fork-updater-ssh-key
secret:
secretName: fork-updater-ssh-key
defaultMode: 256 # yaml spec does not support octal mode
containers:
- name: fork-updater
imagePullPolicy: IfNotPresent
image: skarlso/repo-updater:1.0.4
env:
- name: GIT_TOKEN
valueFrom:
secretKeyRef:
name: fork-updater-secret
key: GIT_TOKEN
volumeMounts:
- name: fork-updater-ssh-key
mountPath: "/etc/secret"
readOnly: true
restartPolicy: OnFailure
Inherently there is nothing wrong with this at first glance. But on a second glance, the problem is restartPolicy: Always
.
For whatever the reason, the cronjob died when it started up. The restart policy then… restarted the cronjob, which failed again
really fast. Then it scheduled a new one and a new one and a new one… and I had 100+ containers pending and running and
creating.
At that point the cluster was basically DDOSd into oblivion. Once the other resources started to die ( since this was a private cluster and I didn’t bother to set up restrictions on resources ) the cronjob hogged even more and it basically blocked everything else from being able to run. It overwhelmed the scheduler.
Lovevly that.
This is how you could potentionally kill a cluster which doesn’t have any resource limits and restrictions set up.
Gergely.