Cloud Native 18 min read

Deploying an Apache Airflow Cluster on Kubernetes with Helm and GitSync

This guide explains how to set up a production‑grade Apache Airflow cluster on Kubernetes using different executors, Helm charts, Git‑based DAG synchronization, custom Docker images, and related tooling such as Chocolatey, Helm, and Git, providing step‑by‑step commands and configuration details.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Deploying an Apache Airflow Cluster on Kubernetes with Helm and GitSync

Data‑processing logic often becomes complex, with scripts that depend on each other and require robust operations monitoring; to address these challenges the author studied Apache Airflow using the official documentation and the book "Data Pipelines with Apache Airflow" and recorded the findings.

After covering Airflow installation, DAGs, and operators in a previous article, this piece focuses on deploying an Airflow cluster.

Cluster Deployment

Airflow supports several executor modes: the default SequentialExecutor (single‑process), LocalExecutor (multi‑process, up to 32 parallel tasks), and cluster‑oriented executors such as CeleryExecutor and KubernetesExecutor that fully exploit a compute cluster. The main difference between Sequential and Local executors is that the former runs in a single process while the latter runs multiple processes.

Celery

Install Airflow with Celery dependencies using:

pip install apache-airflow[celery]

Celery implements a distributed task queue consisting of:

Producer (Celery client) that creates tasks and pushes them to a broker.

Broker (e.g., RabbitMQ or Redis) that routes tasks to workers.

Consumer (Celery worker) that receives tasks and executes them.

One or more message queues, each bound to a specific worker via an exchange and routing key.

Flower can be used to monitor Celery workers ( Airflow docs ). All cluster nodes must have Airflow installed and share the same DAGS_FOLDER , typically synchronized with configuration‑management tools such as Chef, Puppet, or Ansible.

Kubernetes

Kubernetes (originated at Google in 1998, open‑sourced in 2014) is an automated container‑orchestration platform that provides deployment, scaling, and management of containerised applications. Its architecture consists of Master nodes (control plane) and Worker nodes, with key components:

API Server – entry point for all HTTP requests.

etcd – distributed key‑value store for cluster state.

Kubelet – runs on each worker, communicates with the API server and manages containers.

Kube‑proxy – handles network traffic between pods.

Scheduler – assigns pods to nodes based on resource requirements.

Controller‑manager – ensures the actual state matches the desired state stored in etcd.

Container engine (Docker) – runs the containers as instructed by Kubelet.

kubectl – CLI for interacting with the cluster.

Important Kubernetes concepts:

Pod – the smallest deployable unit, encapsulating one or more containers.

Volume – a storage abstraction that can be mounted into pods.

Deployment – manages a set of identical pods, handling replica count and versioning.

Service – provides a stable network endpoint for a group of pods.

Namespace – logical isolation within a cluster for resources and access control.

The official Airflow documentation recommends installing Airflow on Kubernetes via a Helm chart. Helm packages Kubernetes manifests (YAML) and simplifies configuration management.

Typical Helm workflow:

Fetch the chart from a repository.

Provide a custom values.yaml file to override defaults.

Merge the custom values with the chart defaults.

Render the final YAML manifests.

Apply the manifests to the cluster with kubectl apply .

Installing Helm on Windows

Run the following PowerShell command as Administrator to install Chocolatey (a package manager used to install Helm):

Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

Then install Helm:

choco install kubernetes-helm

Add the Airflow chart repository and install/upgrade the release:

helm repo add apache-airflow https://airflow.apache.org
helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace

After a successful installation the output includes the default credentials (username: admin , password: admin ) and a port‑forward command to access the Airflow web UI:

kubectl port-forward svc/airflow-webserver 8080:8080 --namespace airflow

DAG Synchronisation

Three approaches exist for keeping DAG files consistent across the cluster; the guide adopts the GitSync method. In values.yaml set:

gitSync:
  enabled: true
  repo: ssh://[email protected]/yzengnash/learngit.git
  branch: main
  rev: HEAD
  depth: 1
  maxFailures: 0
  subPath: "tests/dags"
  sshKeySecret: airflow-ssh-secret

Create a Kubernetes secret that contains the private SSH key used for the Git repository:

kubectl create secret generic airflow-ssh-git-secret --from-file=gitSshKey=$HOME/.ssh/id_rsa -n airflow

Verify the secret:

kubectl get secrets -n airflow

Upgrade the Helm release with the new values:

helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug

Handling Image Pull Errors

When the cluster cannot pull the default git‑sync image due to network restrictions, a private mirror (e.g., Alibaba Cloud) can be used. Build a custom image, tag it to match the expected name, push it to the private registry, and reference it in the Helm values.

# Build custom image locally
# Tag to the original image name
docker tag registry.cn-hangzhou.aliyuncs.com/nash_image/image_repo:v1 k8s.gcr.io/git-sync/git-sync:v3.4.0
# Push to private registry
docker login registry.cn-hangzhou.aliyuncs.com
docker push registry.cn-hangzhou.aliyuncs.com/nash_image/image_repo:v1

Re‑run the Helm upgrade command to apply the corrected image.

Git Installation and Configuration

Install Git for Windows ( download page ), then configure global user information:

git config --global "user.name" "Your Name"
git config --global "user.email" "[email protected]"

Initialize a repository, add files, commit, and push to GitHub:

git init
git add .
git commit -m "Initial commit"
git remote add origin https://github.com/yzengnash/learngit.git
git branch -M main
git push -u origin main

Generate an SSH key pair, add the public key to GitHub, and configure the remote URL to use SSH for authentication. If SSL verification causes errors, disable it temporarily:

git config --global http.sslVerify "false"

The article concludes with a promise to cover other scheduling tools such as DolphinScheduler and DataWorks in future posts.

CI/CDkubernetesCluster DeploymentAirflowhelmGitSync
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.