Skipping DAG steps in highly parallel builds with Error response from daemon: unauthorized

We are seeing builds fail or skip steps.

This is happening on highly parallel builds where we will see Error response from daemon: unauthorized when the build is executing. Other build steps will run on this agent in the pipeline.

We see this on the ui

Let me know if I can provide any further info

Error response from daemon: unauthorized

this error indicates the image did not exist in the local Docker cache, therefore the runner made an API request to the docker daemon on the host to pull the image. The Docker daemon responded to this API call with an unauthorized error. This means the registry rejected the request with 403 unauthorized. I recommend checking the docker daemon debug logs to see if docker provides more details regarding why the registry rejected the request to pull the image.

the reason the step was skipped, I presume, is because the pipeline is in an failed state. The reason the pipeline would be in a failed state is because one of the steps failed due to inability to pull the docker image. When the pipeline is in a failed state, by default, the remaining steps are skipped unless they were explicitly configured to execute on failure, using when: { status: [ failure ] }

so based on this screenshot, the root cause seems to be problems pulling a private image from a registry. I hope that helps.

1 Like

Deffo, sadly it pulled the same image about 25 other times.

But that gives us a good place to look. I wonder if it rejected it after being asked lots of times. Or the repo did.

this is certainly possible. One way to mitigate such issues is to use pull: if-not-exists [1] which instructs the system to always use the docker image in the local cache if it exists. If a private registry is rate limited or becomes unstable under heavy load, this could help reduce the number of requests and overall load.

[1] https://docs.drone.io/pipeline/docker/syntax/images/#pulling-images

Looking at the events on the daemon they seem pretty clear.

020-09-24T13:07:06.168910450Z image pull image/path/kube-drone-deploy:4 (name=image/path/kube-drone-deploy, org.label-schema.build-date=2020-09-23T14:39:00Z, org.label-schema.schema-version=1.0, org.label-schema.vcs-ref=b1973e92478cf60b5f94f9c34bfc9cdb6382185d, org.label-schema.vcs-url=[gitlocation](https://github.com/sqsp/kube-drone-deploy.git))

2020-09-24T13:07:06.171544010Z image pull image/path/kube-drone-deploy:4 (name=image/path/kube-drone-deploy, org.label-schema.build-date=2020-09-23T14:39:00Z, org.label-schema.schema-version=1.0, org.label-schema.vcs-ref=b1973e92478cf60b5f94f9c34bfc9cdb6382185d, org.label-schema.vcs-url=[gitlocation](https://github.com/sqsp/kube-drone-deploy.git))

2020-09-24T13:07:06.171636221Z image pull image/path/kube-drone-deploy:4 (name=image/path/kube-drone-deploy, org.label-schema.build-date=2020-09-23T14:39:00Z, org.label-schema.schema-version=1.0, org.label-schema.vcs-ref=b1973e92478cf60b5f94f9c34bfc9cdb6382185d, org.label-schema.vcs-url=[gitlocation](https://github.com/sqsp/kube-drone-deploy.git))

2020-09-24T13:07:06.189321842Z image pull image/path/kube-drone-deploy:4 (name=image/path/kube-drone-deploy, org.label-schema.build-date=2020-09-23T14:39:00Z, org.label-schema.schema-version=1.0, org.label-schema.vcs-ref=b1973e92478cf60b5f94f9c34bfc9cdb6382185d, org.label-schema.vcs-url=[gitlocation](https://github.com/sqsp/kube-drone-deploy.git))

2020-09-24T13:07:06.201491115Z volume create drone-Sjbi8KPofEftxrmcJoHf (driver=local)

2020-09-24T13:07:06.205448101Z volume create drone-Sjbi8KPofEftxrmcJoHf (driver=local)

2020-09-24T13:07:06.211156679Z volume create drone-Sjbi8KPofEftxrmcJoHf (driver=local)

2020-09-24T13:07:06.214519292Z volume create drone-Sjbi8KPofEftxrmcJoHf (driver=local)

It then proceeds to bring up and stop lots of containers. Same on Artifactory, we don’t see any pull errors in the logs.

Could this be occurring for any other reason? Are there any other logs I can grab?

We know for sure that Error response from daemon: unauthorized comes from Docker trying to pull the image Artifactory. Unfortunately we only know what Docker tells us, which is that the pull failed with a 403 Unauthorized error. The only root cause I am aware of that would result in a 403 Unauthorized is missing credentials or insufficient permissions for the credentials provided. I recommend contacting the Artifactory support folks who should be able help triage image pull issues, including how to enable more verbose logging to understand where this error comes from at the daemon or registry layer.

As an aside, I recall another team having issues with Artifactory and there were able to trace the issue back to the Artifactory server being overwhelmed:

It seems that setting proxy_max_temp_file_size 0; in our nginx conf that sits in front of artifactory has solved the problem. For very large docker images, nginx was caching all the layers to disk before sending them in the response, which would cause timeouts and errors.

Not sure if this is related but may be worth looking into.

Absolutely, We have that set in artifactory already. I just wish we could see an error in the docker events or artifactory. This has to be coming from the daemon though. I wonder what is causing it.