(Self-managed edition) -- Flushing redis cache after recovering from various failures leaving cache in a stale or corrupted state

Howdy,

Every now and then something unexpectedly bad happens that everyone put so much effort into to prevent, such as a complete SAN failure, or other outage(power, etc), and this can sometimes leave Harness in a bad state, namely the redis cache and it is hard to tell from examining the logs alone unfortunately. The number one thing I have noticed is accounts that usually have no issue logging in, will login, and load the dashboard, the the JWT(bearer) token will expire and dump you back to the login screen time and time again.

Since it is such a harmless operation, there is no sense in going into the details around what log messages(if any at times) to look for within harness-gateway or harness-manager. If you are having issues login or otherwise, here is what I recommend doing and then attempting to login again:

  1. Determine the primary redis sentinel server, this is USUALLY by default going to be redis-sentinel-harness-server-0 of the 3 pods, with 1 and 2 being secondary. You can confirm this by executing the following via kubectl:
kubectl exec -n harness -it redis-sentinel-harness-server-0 -- redis-cli INFO |grep role

If it is not the master in the cluster, it will reply with:

image

When you find the master sentinel then move on to step 2.

  1. Now that we have our master determined, a simple kubectl command to the container(using 0 as master in this case), using redis-cli to flush the cache as follows should do the trick:
kubectl exec -n harness -it redis-sentinel-harness-server-0 -- redis-cli FLUSHDB
  1. The third and final step to rebuild the cache(note: system will at first be a bit slower than usual), is to simply bounce the manager pod, or scale replicas to 0, then back up to the normal number, eg;
kubectl -n harness scale deployment harness-manager --replicas=0

Once all manager(s) are down, then the reverse:

kubectl -n harness scale deployment harness-manager --replicas=2

After this the manager(s) will connect and rebuild redis cache and you should be in good shape moving forward.

One thing to note, harness-gateway is also reliant somewhat on redis, so restarting these pod(s) can also be a good idea if manager alone does not work.

Until next time.

2 Likes