The GitLab CI Playbook: Running 75,000 Jobs a Day

A team at a large enterprise runs 75,000 GitLab CI jobs every day. Their setup is almost boring.

EKS for runner management. Official Helm chart. One runner namespace per team. Fleeting plugin for autoscaling on spot instances.

“We don’t think about it,” they said. “It just works.”

That’s the goal. Getting there requires some deliberate architectural decisions. Here’s what teams running GitLab CI at scale have figured out.

The Hierarchy That Makes Everything Else Work

GitLab CI isn’t just about pipelines. It’s about the organizational structure those pipelines inherit from.

GitLab’s hierarchy — Groups, Subgroups, Projects — determines how settings cascade. A CI template defined at the group level is available to every project underneath it. A label created at the group level propagates everywhere. A runner registered at the group level executes jobs for every project.

The most common mistake: creating one flat group with 200 projects. At 50 projects, it’s manageable. At 200, it’s chaos. Nobody knows which labels are standardized. Nobody knows which templates are current. Permissions become a nightmare.

The fix: mirror your organizational structure. If engineering is divided into Platform, Backend, and Frontend teams, those become groups. Products underneath become subgroups. Individual repositories become projects. A team of 200 engineers might have 10 groups, 5 subgroups each, 4 projects each — the same 200 repositories, but with natural organizational boundaries.

Runner Fleet Design: The Part Nobody Tells You

Runners execute your CI jobs. Design matters.

One runner per cluster. GitLab’s documentation is explicit: don’t deploy multiple runners on the same Kubernetes cluster. One runner installation per cluster, with tagged jobs to route work appropriately.

Three levels of runners:

Instance-level runners handle shared infrastructure — linting, formatting, simple checks. Group-level runners handle team-specific build and test jobs. Project-level runners handle truly unique requirements like GPU compute or Mac builds.

Tags route jobs to runners. Tag by capability, not by team. Use high-memory, not team-backend. This way, any team that needs high-memory runners can use them without creating a dedicated runner for every team.

Ephemeral runners are not optional. Jobs leave state behind. Environment variables, temp files, Docker layers. Non-ephemeral runners accumulate that state. Eventually, something breaks in a way nobody can reproduce. Use fresh instances for every job.

Caching That Actually Works

A team running Node.js builds without caching spends 40% of every pipeline downloading node_modules. That’s not compute. That’s waiting.

Configure caching in two places:

# .gitlab-ci.yml
cache:
  key: ${CI_COMMIT_REF_SLUG}
  paths:
    - node_modules/
    - .cache/

# config.toml
[runners.cache]
  Type = "s3"
  Shared = true
  [runners.cache.s3]
    ServerAddress = "s3.amazonaws.com"
    BucketName = "gitlab-runner-cache"

The YAML tells GitLab what to cache. The TOML tells the runner where to store it. Both are required. Missing either one means no caching — or worse, caching that silently fails.

Pipeline Patterns That Prevent Regret

Use needs for DAG ordering. Sequential stages mean slow pipelines. The needs keyword lets jobs declare their dependencies explicitly:

deploy-job:
  stage: deploy
  needs:
    - job: test-unit
      optional: true
    - job: sast
      optional: true

Deploy runs as soon as unit tests finish. SAST can keep running in parallel. You don’t wait for every test to complete before starting deployment.

Use rules: changes to skip irrelevant jobs. Running the full test suite on a README change is wasteful. GitLab CI knows which files changed:

lint-job:
  rules:
    - if: '$CI_COMMIT_BRANCH == "develop"'
      changes:
        - "**/*.js"
        - "**/*.ts"

No JavaScript changes? Skip the JavaScript linter. Ten minutes saved per pipeline.

Use YAML anchors for shared configuration. The same before_script across 20 jobs is 20 opportunities for drift:

.base-job: &base-job
  image: node:18
  before_script:
    - npm ci

build:
  <<: *base-job
  script:
    - npm run build

One definition. Everywhere it’s needed. Change it once, it changes everywhere.

The Kubernetes Pattern at Scale

At 50,000-75,000 jobs per day, the pattern converges:

EKS (AWS) or GKE (GCP) for runner orchestration. The official Helm chart deploys runners. Each team gets a dedicated namespace so resource contention is isolated.

Fleeting plugin for autoscaling. When a job is queued, Fleeting spins up a spot instance. The job runs. The instance terminates. You pay only for what you use.

T-shirt sizing with resource limits. Small jobs get small instances. Large builds get large instances. Tags route jobs to the right tier.

A team running 20,000 jobs per week on bare metal described the alternative: Docker executor, t-shirt sized machines, specific tags for resource selection. Works at 20K/week. At 75K/day, you need Kubernetes.

Security Scanning as a Pipeline Stage

If you’re on Ultimate, security scanning should be mandatory. Not optional. Not “run when someone remembers.” A required pipeline stage that gates deployment.

security:
  stage: security
  script:
    - npm install
  artifacts:
    reports:
      sast: gl-sast-report.json
      dependency_scanning: gl-dependency-scanning-report.json

The scanners run. The results feed the Security Dashboard. If a critical vulnerability is new, the merge request gets blocked. Developers can’t bypass it — not because of policy, but because the pipeline enforces it.

This is what “shift-left” actually means in implementation, not marketing slides.

Pipeline examples are illustrative patterns based on GitLab official documentation and practitioner-reported configurations. Adapt them to your specific language, build system, and security requirements.

The Hierarchy That Makes Everything Else Work#

Runner Fleet Design: The Part Nobody Tells You#

Caching That Actually Works#

Pipeline Patterns That Prevent Regret#

The Kubernetes Pattern at Scale#

Security Scanning as a Pipeline Stage#