One morning I woke up and tried to access my gitea just to find that it wasn’t running.
I checked my cluster and found that the whole thing was dead as meat. I quickly jumped in and ran
k get pods -A to see what’s
going on. None of my services worked.
What immediately struck my eye was a 100+ pods of my fork_updater cronjob. The fork_updater cronjob which runs once a month, looks like this:
apiVersion: batch/v1beta1 kind: CronJob metadata: name: fork-updater namespace: fork-updater spec: schedule: "* * 1 * *" jobTemplate: spec: template: spec: volumes: - name: fork-updater-ssh-key secret: secretName: fork-updater-ssh-key defaultMode: 256 # yaml spec does not support octal mode containers: - name: fork-updater imagePullPolicy: IfNotPresent image: skarlso/repo-updater:1.0.4 env: - name: GIT_TOKEN valueFrom: secretKeyRef: name: fork-updater-secret key: GIT_TOKEN volumeMounts: - name: fork-updater-ssh-key mountPath: "/etc/secret" readOnly: true restartPolicy: OnFailure
Inherently there is nothing wrong with this at first glance. But on a second glance, the problem is
For whatever the reason, the cronjob died when it started up. The restart policy then… restarted the cronjob, which failed again
really fast. Then it scheduled a new one and a new one and a new one… and I had 100+ containers pending and running and
At that point the cluster was basically DDOSd into oblivion. Once the other resources started to die ( since this was a private cluster and I didn’t bother to set up restrictions on resources ) the cronjob hogged even more and it basically blocked everything else from being able to run. It overwhelmed the scheduler.
This is how you could potentionally kill a cluster which doesn’t have any resource limits and restrictions set up.