Skip to main content

Command Palette

Search for a command to run...

How I Tricked ArgoCD Into Sharding on a Single Cluster

Updated
8 min read

The Problem That Bugged Me

So I was working on a scaling problem for a fintech startup. We run ArgoCD to manage our Kubernetes deployments. Pretty standard GitOps setup ... nothing fancy. We had around 148 applications running on a single EKS cluster, managed by ArgoCD in HA mode with 2 application controller replicas.

Everything looked fine on paper. Two controller pods, workload should be split, right?

Wrong.

One fine day I was debugging some sync delays and ran kubectl top pods:

NAME                                  CPU(cores)   MEMORY(bytes)
argocd-application-controller-0       846m         449Mi
argocd-application-controller-1       6m           46Mi

Controller-0 was sweating at 846m CPU. Controller-1 was literally chilling at 6m - doing absolutely nothing 🙂 All 148 apps were being reconciled by a single pod.

I checked the logs:

time="..." level=info msg="Cluster https://kubernetes.default.svc has been assigned to shard 0"

Shard 0. Everything on shard 0. Controller-1 had zero clusters assigned.

As a temporary fix, I bumped the CPU limits for controller-0 - but that's just a bandaid 🩹, not a solution. You're throwing more resources at a pod that shouldn't be doing all the work alone.

Why This Happens

Here's the thing most people (including me) don't realise at first:

ArgoCD shards by cluster, not by application.

If you have 10 clusters and 2 controller replicas, each controller gets ~5 clusters. Beautiful.

But if you have 1 cluster and 2 controller replicas? That single cluster goes to one shard, and the other controller sits idle. Doesn't matter if you have 10 apps or 10,000 apps -- all of them go through one controller because they all point to the same cluster.

Adding more replicas? Useless. I tried. Went from 2 to 3 replicas. The third pod also sat idle. It's not a scaling problem - it's an architectural one.

Asking Around

I posted on the ArgoCD Slack channel, explained the situation. The general consensus was:

"Your only recourse would be to scale up"

Fair enough. But we're a banking environment - we don't spin up clusters for fun. We have one non-prod cluster, one prod cluster, and that's it. Regulatory compliance, cost, operational overhead - all the usual reasons.

So I went digging.

The ExternalName Trick!

After a lot of reading through ArgoCD source code, GitHub issues, and blog posts, I stumbled upon something interesting.

What if we could fake multiple clusters?

In Kubernetes, there's a service type called ExternalName. It's basically a DNS alias - it resolves to whatever hostname you point it to. So if I create an ExternalName service that points back to kubernetes.default.svc.cluster.local (which is just the local API server), ArgoCD would see it as a "different" cluster.

Same API server. Same cluster. But ArgoCD thinks it's managing two separate clusters.

Here's what I did, step by step.

Step 1: Create Fake Clusters (ExternalName Services)

apiVersion: v1
kind: Service
metadata:
  name: argocd-cluster-00
  namespace: argocd-system
spec:
  type: ExternalName
  externalName: kubernetes.default.svc.cluster.local
---
apiVersion: v1
kind: Service
metadata:
  name: argocd-cluster-01
  namespace: argocd-system
spec:
  type: ExternalName
  externalName: kubernetes.default.svc.cluster.local

Both services point to the same place. But they have different DNS names:

  • argocd-cluster-00.argocd-system.svc.cluster.local

  • argocd-cluster-01.argocd-system.svc.cluster.local

Step 2: Create a Non-Expiring Service Account Token

ArgoCD needs a bearer token to talk to "external" clusters (anything that's not the built-in https://kubernetes.default.svc). Since these fake clusters are still the same API server, we can reuse the controller's own service account:

apiVersion: v1
kind: Secret
metadata:
  name: argocd-controller-token
  namespace: argocd-system
  annotations:
    kubernetes.io/service-account.name: argocd-application-controller
type: kubernetes.io/service-account-token

Why not kubectl create token? I tried that first, but the cluster had a 24-hour max token duration. In a banking setup, you don't want your sharding to break every morning because a token expired. The old-style SA secret gives you a non-expiring token.

Step 3: Register Fake Clusters in ArgoCD

ArgoCD discovers clusters through Kubernetes Secrets with a specific label. I created one secret per fake cluster:

apiVersion: v1
kind: Secret
metadata:
  name: argocd-cluster-00-secret
  namespace: argocd-system
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
stringData:
  name: argocd-cluster-00
  server: https://argocd-cluster-00.argocd-system.svc.cluster.local
  shard: "0"
  config: |
    {
      "bearerToken": "<YOUR_SA_TOKEN>",
      "tlsClientConfig": {
        "insecure": true
      }
    }
---
apiVersion: v1
kind: Secret
metadata:
  name: argocd-cluster-01-secret
  namespace: argocd-system
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
stringData:
  name: argocd-cluster-01
  server: https://argocd-cluster-01.argocd-system.svc.cluster.local
  shard: "1"
  config: |
    {
      "bearerToken": "<YOUR_SA_TOKEN>",
      "tlsClientConfig": {
        "insecure": true
      }
    }

Notice the shard: "0" and shard: "1" fields. This is manual shard assignment - more on why this matters below.

Step 4: Why Manual Shard Assignment Was Necessary

After creating the cluster secrets, I checked the controller logs expecting to see a nice even split. Instead:

level=info msg="Cluster https://argocd-cluster-00... has been assigned to shard 0"
level=info msg="Cluster https://argocd-cluster-01... has been assigned to shard 0"
level=info msg="Cluster https://kubernetes.default.svc has been assigned to shard 0"

Everything on shard 0 again! 😤

Turns out, ArgoCD's default sharding algorithm is called Legacy. It uses hash(cluster_id) % total_shards, and with just 2-3 clusters, the hash can easily put everything on the same shard.

The fix is simple - set the shard field explicitly in the cluster secret. ArgoCD respects this and skips the hash function entirely. After adding the shard fields:

level=info msg="Cluster https://argocd-cluster-01... has changed shard from 0 to 1"

Now controller-0 manages cluster-00, controller-1 manages cluster-01.

Step 5: Update AppProjects

This one almost bit me. ArgoCD Projects restrict which destination servers apps can deploy to. If your project only allows https://kubernetes.default.svc, then apps pointing to the new ExternalName URLs will be rejected.

I had to add the new cluster URLs to every project:

kubectl patch appproject my-project -n argocd-system --type='json' -p='[
  {"op":"add","path":"/spec/destinations/-","value":{"server":"https://argocd-cluster-00.argocd-system.svc.cluster.local","namespace":"*-service"}},
  {"op":"add","path":"/spec/destinations/-","value":{"server":"https://argocd-cluster-01.argocd-system.svc.cluster.local","namespace":"*-service"}}
]'

Don't skip this step. I've seen people get stuck here wondering why their apps won't sync after migration.

Step 6: Migrate Apps

Split 148 apps - 74 to cluster-00, 74 to cluster-01:

kubectl patch application <app-name> -n argocd-system --type='json' \
  -p='[{"op":"replace","path":"/spec/destination/server","value":"https://argocd-cluster-00.argocd-system.svc.cluster.local"}]'

I wrote a simple bash script that read app names from a file and patched them in bulk.

The ApplicationSet Gotcha

After running the migration script, I checked the distribution:

  74 https://argocd-cluster-00.argocd-system.svc.cluster.local
  52 https://argocd-cluster-01.argocd-system.svc.cluster.local
  22 https://kubernetes.default.svc

22 apps refused to move! Turns out they were managed by ApplicationSets. When you patch an Application that's owned by an ApplicationSet, the ApplicationSet controller overwrites your change back to the original destination.

The fix: patch the ApplicationSet's template instead.

kubectl patch applicationset perf-testing-stack -n argocd-system --type='json' \
  -p='[{"op":"replace","path":"/spec/template/spec/destination/server","value":"https://argocd-cluster-01.argocd-system.svc.cluster.local"}]'

After that, clean 74/74 split.

The Result

Before:

argocd-application-controller-0   846m    449Mi
argocd-application-controller-1   6m      46Mi

After:

argocd-application-controller-0   72m     1723Mi
argocd-application-controller-1   10m     774Mi    (ramping up)

Controller-0 dropped from 846m to 72m CPU. Controller-1 went from idle to actively processing its 74 apps. Over the next hour, both controllers converged to roughly equal resource usage.

All 134 healthy apps remained healthy. Zero disruption. Zero downtime.

Things to Be Careful About

Let me be very clear - this is a hack, not an official feature. It works, it's stable, but you should be aware of:

  1. API server load: Both controllers now independently hit the same API server. Monitor your kube-apiserver metrics. In my case, the combined CPU/memory of both controllers was roughly the same as the single overloaded controller before, so API load didn't increase noticeably.

  2. Token management: The SA token in the cluster secrets needs to stay valid. If someone deletes the service account or the token secret, both fake clusters lose access.

  3. Terraform / GitOps drift: If your ArgoCD setup is managed by Terraform or Helm, make sure you codify all of this. I initially did everything manually, then had to go back and add everything to Terraform: ExternalName services, cluster secrets, AppProject changes, the lot. Don't skip this.

  4. ApplicationSets: As I learned the hard way, patching individual apps doesn't work if they're owned by ApplicationSets. Always check ownerReferences before migrating.

  5. Sync status after migration: Apps will briefly show OutOfSync after changing the destination server. This is normal: ArgoCD re-compares the state against the "new" cluster. If auto-sync is enabled, it resolves automatically. If not, a manual sync fixes it.

When Should You Use This?

  • You have a single cluster with many apps and ArgoCD controller is a bottleneck

  • You've already tried tuning controller flags (--status-processors, --operation-processors) and it's not enough

  • You can't add more clusters (cost, compliance, operational reasons)

  • You need a quick win while waiting for ArgoCD's official dynamic sharding feature to mature (it's still alpha as of v2.13)

When Should You NOT Use This?

  • If you have multiple clusters already - just use normal sharding

  • If your bottleneck is the repo-server or Redis, not the controller

  • If you're not comfortable maintaining a non-standard setup

Final Thoughts

This whole exercise... from discovering the problem to having a working solution... took me about 2 days. The trick itself is simple, but the devil is in the details: manual shard assignment, AppProject updates, ApplicationSet awareness, and codifying everything in IaC.

It's running on our non-prod cluster right now. If things stay stable over the next week, we'll roll it out to production.

Sometimes the best solutions aren't in the docs, they're in understanding how the system works under the hood and bending it just enough to solve your problem.


If you found this useful, feel free to share it with your team. And if you've found a better way to handle single-cluster ArgoCD scaling, I'd love to hear about it -drop a comment below!