In some scenarios, the drone-agent cannot notify the drone-server of the build status

I have encountered a strange problem.
The drone UI,shows that there is a build record,that could not be executed.

It looks like it has not been consumed by drone-agent. However, in fact, this record was successfully consumed by the drone-agent and the build process was completed (after the build, the docker image was generated normally).

I have seen the log of drone-server and drone-agent. From the log, drone-agent, in the process of notifying the drone-server status change, passed the ID as zero.

{“time”:“2018-10-19T03:02:09Z”,“level”:“debug”,“repo”:“mbrand/service_acl”,“build”:“62”,“id”:“0”,“message”:“pipeline lease renewed”}
grpc error: extend(): code: Unknown: rpc error: code = Unknown desc = queue: task not found

The drone-agent use id 0 to comuicate with drone-server, as you can see the logs above. And then, I read the drone source code,and the following code is part of the build trigger process( https://github.com/drone/drone/blob/master/server/hook.go. func PostHook() ).

func PostHook(c *gin.Context) {
....

err = store.CreateBuild(c, build, build.Procs...)
if err != nil {
	logrus.Errorf("failure to save commit for %s. %s", repo.FullName, err)
	c.AbortWithError(500, err)
	return
}

c.JSON(200, build)

if build.Status == model.StatusBlocked {
	return
}

b := builder{
	Repo:  repo,
	Curr:  build,
	Last:  last,
	Netrc: netrc,
	Secs:  secs,
	Regs:  regs,
	Envs:  envs,
	Link:  httputil.GetURL(c.Request),
	Yaml:  conf.Data,
}
items, err := b.Build()
if err != nil {
	build.Status = model.StatusError
	build.Started = time.Now().Unix()
	build.Finished = build.Started
	build.Error = err.Error()
	store.UpdateBuild(c, build)
	return
}

var pcounter = len(items)

for _, item := range items {
	build.Procs = append(build.Procs, item.Proc)
	item.Proc.BuildID = build.ID

	for _, stage := range item.Config.Stages {
		var gid int
		for _, step := range stage.Steps {
			pcounter++
			if gid == 0 {
				gid = pcounter
			}
			proc := &model.Proc{
				BuildID: build.ID,
				Name:    step.Alias,
				PID:     pcounter,
				PPID:    item.Proc.PID,
				PGID:    gid,
				State:   model.StatusPending,
			}
			build.Procs = append(build.Procs, proc)
		}
	}
}
err = store.FromContext(c).ProcCreate(build.Procs)
if err != nil {
	logrus.Errorf("error persisting procs %s/%d: %s", repo.FullName, build.Number, err)
}

From the code point of view, after the code executes

err = store.CreateBuild

, it directly informs the client that it succeeds, but in fact, the code behind

err = store.FromContext(c).ProcCreate(build.Procs)

If the execution is unsuccessful, It means that the operation of the database fails, the primary key ID of proc is 0.
As a result, the drone-agent consumes the wrong queue data in drone-server, causing drone-agent to notify drone-server of the build status with the wrong ID.

So I would like to ask, why not put the PostHook function,

c.JSON(200, build)

after

 err = store.FromContext(c).ProcCreate(build.Procs)
if err != nil {
    logrus.Errorf("error persisting procs %s/%d: %s", repo.FullName, build.Number, err)
}

like the code below

  err = store.FromContext(c).ProcCreate(build.Procs)
  if err != nil {
    build.Status = model.StatusError
    build.Error = err.Error()
    store.UpdateBuild(c, build)
    logrus.Errorf("error persisting procs %s/%d: %s", repo.FullName, build.Number, err)
    c.String(500, "Failed to create procs  %s/%d: %s", repo.FullName, build.Number, err)
    return
  }

If my thoughts are wrong, please tell me the reasons, if I am right, can I submit a pull request to the official repo?

1 Like