Triggered Builds by cron have high deviation against expected time

Hello,

We are using drone-server and drone-docker-runner to automate some test.
We are using cron feature within a @every 10m expression, so, we expect that each build is being executed every 10m since the first execution.

As you can see on the table below, with 25 consecutive builds, we have encountered the following:

  • The started build time is different than expected and seems to be increasing on each execeution
  • The Jitter (difference between real and expected time on consecutive executions)
    are very high, up to 16% from the expected one.

We have not seen anything relevant to cron in our logs (trace enabled on server and runner) .

  • Is there any way to debug to check if there is some problem on cron schedule?

Table with last 25 builds, scheduled by cron with @every 10m expression

Build Cron Real Started Build time Expected Started Build time Difference between Real and Expected time (in seconds) Real time difference betweeen consecutive executions (s) Expected time difference betweeen consecutive executions (s) Jitter (s) Jitter(%)
1 every_10m 09/01/2020 08:13:14 09/01/2020 08:13:14 0 0 0 0 0%
2 every_10m 09/01/2020 08:24:00 09/01/2020 08:23:14 46 646 600 46 7,7%
3 every_10m 09/01/2020 08:33:56 09/01/2020 08:33:14 42 596 600 -4 0,7%
4 every_10m 09/01/2020 08:45:10 09/01/2020 08:43:14 116 674 600 74 12,3%
5 every_10m 09/01/2020 08:55:55 09/01/2020 08:53:14 161 645 600 45 7,5%
6 every_10m 09/01/2020 09:06:12 09/01/2020 09:03:14 178 617 600 17 2,8%
7 every_10m 09/01/2020 09:16:07 09/01/2020 09:13:14 173 595 600 -5 0,8%
8 every_10m 09/01/2020 09:25:55 09/01/2020 09:23:14 161 588 600 -12 2,0%
9 every_10m 09/01/2020 09:37:02 09/01/2020 09:33:14 228 667 600 67 11,2%
10 every_10m 09/01/2020 09:48:40 09/01/2020 09:43:14 326 698 600 98 16,3%
11 every_10m 09/01/2020 09:57:56 09/01/2020 09:53:14 282 556 600 -44 7,3%
12 every_10m 09/01/2020 10:08:56 09/01/2020 10:03:14 342 660 600 60 10,0%
13 every_10m 09/01/2020 10:19:36 09/01/2020 10:13:14 382 640 600 40 6,7%
14 every_10m 09/01/2020 10:29:59 09/01/2020 10:23:14 405 623 600 23 3,8%
15 every_10m 09/01/2020 10:40:04 09/01/2020 10:33:14 410 605 600 5 0,8%
16 every_10m 09/01/2020 10:51:18 09/01/2020 10:43:14 484 674 600 74 12,3%
17 every_10m 09/01/2020 11:02:07 09/01/2020 10:53:14 533 649 600 49 8,2%
18 every_10m 09/01/2020 11:12:03 09/01/2020 11:03:14 529 596 600 -4 0,7%
19 every_10m 09/01/2020 11:22:03 09/01/2020 11:13:14 529 600 600 0 0,0%
20 every_10m 09/01/2020 11:33:01 09/01/2020 11:23:14 587 658 600 58 9,7%
21 every_10m 09/01/2020 11:43:01 09/01/2020 11:33:14 587 600 600 0 0,0%
22 every_10m 09/01/2020 11:54:02 09/01/2020 11:43:14 648 661 600 61 10,2%
23 every_10m 09/01/2020 12:04:08 09/01/2020 11:53:14 654 606 600 6 1,0%
24 every_10m 09/01/2020 12:15:02 09/01/2020 12:03:14 708 654 600 54 9,0%
25 every_10m 09/01/2020 12:25:01 09/01/2020 12:13:14 707 599 600 -1 0,2%

Accumulated Delay

Jitter %

Host: drone-server

  • CPU: 8 cores
  • Mem: 12GB
  • Version: from docker (1month ago) drone:latest@sha256:693188bf26a3478b5e9f16da0c1f0b1bd4bc38114c898588f5b216cac3c0a29a
  • Relevant config:
DRONE_CRON_DISABLED=false
DRONE_CRON_INTERVAL=1m

Host: drone-docker-runner

  • CPU: 16 cores
  • Mem: 32GB
  • Version: from docker (1month ago) drone-runner-docker:latest@sha256:056912298f390f7d8a65e4508b628cb3806192257215a534685692da684db65b
  • Relevant config:
DRONE_RUNNER_CAPACTIY=12

the build start time is not really the best measure here. The cron does not guarantee the build will start every 10 minutes, it will be scheduled every 10 minutes. The actual start time depends on a number of factors, including capacity and number of builds in the queue. The system also performs a number of operations before the build starts which can impact start time, such as API calls and fetching the Yaml file, which adds to latency of start time. Finally, the runners perform long polling with a backoff which can result in delays of up to 15 seconds when the runner pulls the build from the queue.

I also want to point out that accuracy of execution time was not a goal when I wrote this code. The default cron routine runs every hour which means a cron scheduled to run every 10 minutes can only run as frequently as once per hour. It also means that cron jobs scheduled to run at specific times can be off by up to 59 minutes.

And as @ashwilliams1 mentions, the start time is not necessarily the best measure. The cron routine only ensures a build is created and enqueued; the actual start time depends on runner capacity and queue wait times.

The cron routine also has blocking call which has a minor impact on interval frequency (below). I debated putting this in a goroutine but ultimately decided against this. I was concerned that someone would reduce the interval to a low level which would result in multiple concurent goroutines, which I wanted to avoid.

	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(dur):
			s.run(ctx)
		}
	}

The above code could produce a few seconds of delay that accumulates over time and could be improved by using a ticker which accounts for slow receivers:

+	ticker := time.NewTicker(dur)
+	defer ticker.Stop()
	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
-		case <-time.After(dur):
+		case <-ticker.C:
			s.run(ctx)
		}
	}

Even with improvements I do not expect the cron routine to provide high levels of accuracy as this was not an inherent design goal. We wanted to provide simple scheduling without allowing the user to accidentally overload the system, which comes at the expense of accuracy. I am not saying there isn’t room for improvement. I just wanted to provide some background regarding the design decisions and tradeoffs made.

Hi @bradrydzewski and @ashwilliams1 thank you for your quick reply.

I would like to discuss ( possibly easy and quick changes) that could improve the cron job support on drone.

We have dozens of cron jobs all configured as “@every Xm”. Following with the use case, we have detected 3 points to review.

1 - Jitter

As yourself commented before, it could be improved with a few changes in code, and if you want we can test it for you

Meanwhile, we are going to set DRONE_CRON_INTERVAL=1s and ensure that server has enough resources on drone-server and drone-runner

We can discuss more on this point but I prefer to prioritize the two next points, as it is directly a critical point in our environment

2 - Accumulated delay

We think it is only happening in jobs with an @every XX expression (our use case), cron with expression like * * * * * may not be affected.

We are reviewing the code and it seems that the next execution time is based of now (from cron interval) instead of last execution time. The next proposal would change this behaviour and retrive the next interval time based on the previous time

We expect avoid accumulated delays from the internal scheduler

# Retrieve next time based on previous time
Next := sched.Next(time.Unix(job.Prev, 0))
if Next.Before(now) {
#If past in time, retreive the next valid time and log skipped job
   while Next.Before(now)  {
      logger.Warnf("skip cron job scheduled in the past at : [%s]",Next)
      Next = sched.Next(Next)
   }
}
job.Prev = job.Next
job.Next = Next.Unix()

Furthermore we would like to point on the query to retrieve ready jobs to be scheduled. It seems that it is skpping in case that the job.Next time is equal than now() (from cron interval), and it will be delayed to the next execution.

We think equal case should be included with an <= as conditional.

Finally, as you said, the proposed change with ticker would improve accuracy of the cron interval execution

3 - Cron jobs alignement

In the same way as you are computing next scheduled time ( from now() ) , you are also aligning all cron tasks based on @every XXX expression it they have all the same period, this is a problem when you have dozens of programmed jobs that need high ‘cpu’ and ‘memory’ resources they will be triggered always at the same time

If we schedule 10 pipelines @every 10m and each pipeline execution needs 1G ram and 1 CPU for 1 minute, when programmed and aligned our drone-runner will execute cron jobs that will need for 10 GB and 10 cpu all once at the first minute and the following 9 minutes drone-runner won’t need any resources. With this behaviour, the resource consumption becomes inneficient and could overload the system

With proposed solution in 2 you can also maintain the random execution distribution on time and spread resources along time. This time would be randomly chosen by the user when it has deploy the pipeline and enabled cron

What to you think about the possibility to implement solution described on 2, to avoid cron jobs alignmenet and accumulated delay on job scheduling

Thanks,
Regards!