diff --git a/TROUBLESHOOTING.md b/TROUBLESHOOTING.md index 521f5591..697c1735 100644 --- a/TROUBLESHOOTING.md +++ b/TROUBLESHOOTING.md @@ -290,3 +290,13 @@ HRA doesn't scale the RunnerDeployment to zero, even though you did configure HR You very likely have some dangling workflow jobs stuck in `queued` or `in_progress` as seen in [#1057](https://github.com/actions-runner-controller/actions-runner-controller/issues/1057#issuecomment-1133439061). Manually call [the "list workflow runs" API](https://docs.github.com/en/rest/actions/workflow-runs#list-workflow-runs-for-a-repository), and [remove the dangling workflow job(s)](https://docs.github.com/en/rest/actions/workflow-runs#delete-a-workflow-run). + +## Slow / failure to boot dind sidecar (default runner) + +**Problem** + +If you noticed that it takes several minutes for sidecar dind container to be created or it exits with with error just after being created it might indicate that you are experiencing disk performance issue. You might see message `failed to reserve container name` when scaling up multiple runners at once. When you ssh on kubernetes node that problematic pods were scheduled on you can use tools like `atop`, `htop` or `iotop` to check IO usage and cpu time percentage used on iowait. If you see that disk usage is high (80-100%) and iowaits are taking a significant chunk of you cpu time (normally it should not be higher than 10%) it means that performance is being bottlenecked by slow disk. + +**Solution** + +The solution is to switch to using faster storage, if you are experiencing this issue you are probably using hdd, switch to ssh fixed the problem in my case. Most cloud providers have a list of storage options to use just pick something faster that your current disk, for on prem clusters you will need to invest in some ssds.