We would like to verify that the Drone supports multi-node testing scenarios for Deep Learning. The case is that we need to test some communication in a small cluster with at least two nodes (instance), so we start a pytest suite and the tested function manages/sends/fetches some jobs across the cluster.
Is this possible and would it be possible to even in some auto-scaling manner, meaning that Drone would create/manage these clusters in AWS/GCP according to actual job queue?
hey there, let’s dig a bit deeper and see if I can help.
Drone does support fanning out across multiple nodes, which would allow you to parallelize your tests across nodes. In the below example, we define four pipelines that execute as a directed graph. Each pipeline can be scheduled on a separate node.
Hi, thx for such a quick reply, let me give you some more background to our use-case. For our testing, we need to use instances with two GPUs (on each machine) and about 40% of the day CI is idle and for about 40% of the time, we have a queue that would require parallelization with factor 2 or 3. SO our ideal scenario is using autoscaling with running base Drone CI on a very simple machine and when testing jobs come to start another instance with GPUs.
Coming to the multi-node case. Let say that our tests need 2 nodes and for simplicity call them A, B.
the model case is running Drone CI on one of these nodes A,B, or eventually on node 0 which would have only limited resources. When a test jobs some it runs on both nodes A and B, so these nodes shall not be used for any other job till the testing finishes. Eventually, if the test job queue exceeds some limit we want to start another two nodes C and D, and run some portion of the queued jobs there.
So basically we need somewhere set that each test jobs needs/takes two instances…
@Borda thanks for the explanation. A few follow-up questions I have.
are you using this approach for all repositories, or only a subset?
which cloud provider are you using?
how would you feel about running these tests directly on virtual machines (instead of inside containers)? we have a special type of pipeline runner that can spawn virtual machines for each build. We currently have an implementation for Digital Ocean virtual machines, but we plan on expanding to support other cloud providers. Would you mind taking a look and letting us know if this would work for your use case, assuming we could support your cloud provider? See https://docs.drone.io/runner/digitalocean/overview/ and https://docs.drone.io/pipeline/digitalocean/overview/
we are using it now for one repo but we want to extend also for two more our repos
at this moment we use GCP but in a few weeks we consider moving to our in-house cloud
In past, we have experienced some hangings/freezing test when turning on Drone CI parallelization (we have not investigated details, but we think that there were some locks and on using GPUs which were unreleased after finishing particular tests) so it may be because using containers instead of a virtual machine?
(How) does this virtual machine would handle two instances, by my understanding virtual machine is just one instance, right?
just for the record, the actual running Drone is here
as an example, our digital ocean runner spawns an single-use virtual machine for each pipeline. So if you wanted to split your tests across two machines, you would define two pipelines, which would spawn two virtual machines. Here is a quick example:
- name: test
- go test -v bar/...
- name: test
- go test -v bar/...
We do not currently have a runner that supports spawning per-pipeline virtual machines on other providers (aws or gcp) but this is something we could invest time in, if there was interest. We have an aws implementation that is close to being complete at github.com/drone-runners/drone-runner-aws
@bradrydzewski Splitting test over multiple machines aka parallel testing is another case, we need to run all tests on a cloud where each test requires the presence of two machines/instances…
So to say we have tests T1 and T2 and a cloud containing machine A and B and for both tests T1, T2 we need to have available A and B as part of the test inter-node/machine communication and sync
Alternatively, you could simulate intra-node communication in Drone by launching service containers. Each service container would look and feel like a separate node, attached to a separate network, and could be used to test network communications.