Failed servers not being killed

I’m having an issue with servers randomly unavailable, which is fine but drone is not killing them off. It could be like the following where it’s failed to ping server, but there’s other errors from time to time, happens pretty frequently though. I’ve tried adjusting reaper settings to no avail.

Name: agent-1EL8Ya70
Address: x.x.x.x
Region:  us-east-1b
Size:    m5.large
State:   error
Error:   Failed to ping the server

Here’s the settings for the autoscaler

|DRONE_AMAZON_INSTANCE|m5.large|
|DRONE_AMAZON_MARKET_TYPE|spot|
|DRONE_AMAZON_PRIVATE_IP|true|
|DRONE_AMAZON_REGION|xxxx|
|DRONE_AMAZON_SECURITY_GROUP|xxxx|
|DRONE_AMAZON_SSHKEY|xxxx|
|DRONE_AMAZON_SUBNET_ID|xxxx|
|DRONE_AMAZON_VOLUME_IOPS|1000|
|DRONE_DATABASE_DATASOURCE|xxxx|
|DRONE_DATABASE_DRIVER|postgres|
|DRONE_ENABLE_PINGER|true|
|DRONE_LOGS_TRACE|false|
|DRONE_POOL_MAX|8|
|DRONE_POOL_MIN|1|
|DRONE_POOL_MIN_AGE|5m|
|DRONE_REAPER_ENABLED|true|
|DRONE_REAPER_INTERVAL|1m|
|DRONE_SERVER_HOST|xxx.xxx.xxx|
|DRONE_SERVER_PROTO|https|
|DRONE_SERVER_TOKEN|xxxx|

we would need to see the full trace logs to advise further. the trace logs are very detailed and will tell us exactly what is happening under the hood.

How would I enable trace logs? I tried adding DRONE_LOGS_TRACE to the autoscaler and it didn’t seem to make a change. I did however noticed the following error.

 {
     "id": "iReZxcbuQyE6ArF9",
     "level": "debug",
     "min-pool": 1,
     "msg": "abort terminating %!d(MISSING) instances to ensure minimum capacity met",
     "servers-running": 2,
     "servers-to-terminate": 2,
     "time": "2020-12-15T20:41:15Z"
 }

you can enable the below settings and then provide the full logs

DRONE_LOGS_DEBUG=true
DRONE_LOGS_TRACE=true

I did however noticed the following error.

Looks like a debug log entry was missing a parameter in the format string. I just patched the code and fixed the format string so that the log entry would be output properly. This is just a formatting issue and this log entry is not relevant to this discussion (but I do appreciate you pointing this out so we could fix)

-Debugf("abort terminating %d instances to ensure minimum capacity met")
+Debugf("abort terminating instances to ensure minimum capacity met")

Glad I can accidentally help. :slight_smile:

I do have both debug and trace enabled, but I don’t see any additional trace logs. I’m assuming it would be in stdout as well? Also I found this interesting.

{
    "id": "vRRwouVKUwTpKncn",
    "level": "debug",
    "max-pool": 8,
    "min-pool": 1,
    "msg": "check capacity",
    "pending-builds": 0,
    "running-builds": 0,
    "server-buffer": 0,
    "server-capacity": 6,
    "server-count": 3,
    "time": "2020-12-15T22:41:17Z"
}

But if you look at the info for the servers, 2 are clearly in error states, but they’re counted towards the server capacity.

Name: agent-aMxzIhB6
Address: x.x.x.29
Region:  us-east-1b
Size:    m5.large
State:   error
Error:   Failed to ping the server
Name: agent-KMWQYahH
Address: 
Region:  
Size:    
State:   error
Error:   InsufficientInstanceCapacity: There is no Spot capacity available that matches your request.
	status code: 500, request id: 705eed3f-9a7f-4a18-9305-57aaa62905af
Name: agent-oLV5EKVY
Address: x.x.x.91
Region:  us-east-1b
Size:    m5.large
State:   running

I could be wrong here but taking into account the error from before

{
    "id": "iK4EWphxxGHETGd4",
    "level": "debug",
    "min-pool": 1,
    "msg": "abort terminating %!d(MISSING) instances to ensure minimum capacity met",
    "servers-running": 1,
    "servers-to-terminate": 2,
    "time": "2020-12-15T22:51:17Z"
}

This leads me to believe that it does in fact want to kill those 2 servers but is somehow getting the math wrong. There’s 3 servers, 2 in error state, 1 running as expected. This process is only taking into account the servers that are running properly. I’d suspect and may test dropping the minimum to 0 which would make this error go away and potentially kill those hosts. But that’s really only treating the symptom not the problem.

I can provide some additional color here … These logs come from the planner routine which intentionally disqualifies errored instances for removal for reasons described here and here. This means there is only 1 healthy server that can be terminated, however, terminating this instance would cause the healthy server count to fall below the minimum, hence the log entry about aborting termination to ensure minimum capacity.

Instead, error instances are removed by the reaper routine. This is a special, opt-in routine, that periodically queries for errored instances and attempts to remove them. If the instance was never created in the first place, it can detect this and update the instance state accordingly.

It would be helpful to provide us with the full logs so we can sort through and analyze. Note that we are not looking for just error logs. The surrounding debug and info logs are just as important so that we can get a better understanding of what is happening, and then advise you accordingly. If the logs are too large to paste into discourse, you can post to a gist or pastebin and provide us the link.

As an example of what we are looking for, there are logs in the reaper that tell is what happens when it comes across an instance with an error state, and whether or not it decides to terminate, or if and why it fails to terminate:

There are also logs in the collector that tells us what happens when it attempts to terminate the instance, and if and why that may have failed:

There is no Spot capacity available that matches your request.
status code: 500

I just wanted to provide a small warning that the autoscaler does not fully support spot instances and you may encounter issues, such as those described here. With that being said, full support for spot instances is on our roadmap and something we hope to tackle in 2021.

I see. That makes sense then if it’s the planner routine running. But it’s interesting that the reaper doesn’t seem to be running. I do have the DRONE_REAPER_ENABLED variable set to true. I also appreciate you working with me to go through this, in the end it’s going to be something simple. I am running this in ECS so I apologize for the formatting of the logs, you can thank cloudwatch for that.

LOGS: https://pastebin.com/RtsVEmZB

It’s interesting as there’s no mention of the reaper.

Bingo! Sorry for not catching this sooner, but the reaper is currently enabled by setting an inline feature flag named DRONE_ENABLE_REAPER

We have a placeholder in our config struct for an official DRONE_REAPER_ENABLED setting, but the code still relies on the DRONE_ENABLE_REAPER feature flag. The intention was to move away from a feature flag once we were ready to graduate this experimental feature to stable, but this never really happened. We will look into graduating this feature, but in the meantime, you can use DRONE_ENABLE_REAPER.

Hopefully this solves the problem for you :slight_smile:

Ah! I knew it was going to be something silly! Always is. Updated the variable and wham, the reaper is out there reaping! Thanks dude!

1 Like