Instance or docker autoscaler host unexpectedly removed causing job failure
Description
- Job running on self-managed instance or docker autoscaler runner executor fails with
ERROR: Job failed (system failure)
- Runner log shows
ERROR: instance unexpectedly removed
Environment
Impacted offerings:
- GitLab.com
- GitLab Dedicated
- GitLab Self-Managed
Solution
Check for any orphaned fleeting plugin processes on the runner host, with parent PID of 1, and kill
them.
For example, the PIDs 37093 and 37530 must be stopped:
# ps -ef | egrep "plugin|PPID" | grep -v grep
UID PID PPID C STIME TTY TIME CMD
root 37093 1 1 05:32 pts/0 00:00:27 /root/.config/fleeting/plugins/registry.gitlab.com/gitlab-org/fleeting/plugins/googlecloud/1.0.0/plugin
root 37530 1 1 05:50 pts/0 00:00:19 /root/.config/fleeting/plugins/registry.gitlab.com/gitlab-org/fleeting/plugins/googlecloud/1.0.0/plugin
root 38094 38085 4 06:08 pts/0 00:00:18 /root/.config/fleeting/plugins/registry.gitlab.com/gitlab-org/fleeting/plugins/googlecloud/1.0.0/plugin
root 38173 38085 4 06:05 pts/0 00:00:17 /root/.config/fleeting/plugins/registry.gitlab.com/gitlab-org/fleeting/plugins/googlecloud/1.0.0/plugin
# kill 37093 37530
Cause
When a runner configured with instance or docker autoscaler executors is started, a fleeting plugin process of the required type is started for each executor. These processes are responsible for issuing requests to the Cloud providers to scale instances up and down.
When a runner is stopped any associated plugin processes should also be stopped. Occasionally however a plugin process may be left running in an orphaned state, and this can interfere with newly started plugin processes, and cause unexpected scaling events.
An issue has been created proposing the runner be enhanced to prevent this situation.