Deploying an Apache Airflow Cluster on Kubernetes with Helm and GitSync
This guide explains how to set up a production‑grade Apache Airflow cluster on Kubernetes using different executors, Helm charts, Git‑based DAG synchronization, custom Docker images, and related tooling such as Chocolatey, Helm, and Git, providing step‑by‑step commands and configuration details.
Data‑processing logic often becomes complex, with scripts that depend on each other and require robust operations monitoring; to address these challenges the author studied Apache Airflow using the official documentation and the book "Data Pipelines with Apache Airflow" and recorded the findings.
After covering Airflow installation, DAGs, and operators in a previous article, this piece focuses on deploying an Airflow cluster.
Cluster Deployment
Airflow supports several executor modes: the default SequentialExecutor (single‑process), LocalExecutor (multi‑process, up to 32 parallel tasks), and cluster‑oriented executors such as CeleryExecutor and KubernetesExecutor that fully exploit a compute cluster. The main difference between Sequential and Local executors is that the former runs in a single process while the latter runs multiple processes.
Celery
Install Airflow with Celery dependencies using:
pip install apache-airflow[celery]Celery implements a distributed task queue consisting of:
Producer (Celery client) that creates tasks and pushes them to a broker.
Broker (e.g., RabbitMQ or Redis) that routes tasks to workers.
Consumer (Celery worker) that receives tasks and executes them.
One or more message queues, each bound to a specific worker via an exchange and routing key.
Flower can be used to monitor Celery workers ( Airflow docs ). All cluster nodes must have Airflow installed and share the same DAGS_FOLDER , typically synchronized with configuration‑management tools such as Chef, Puppet, or Ansible.
Kubernetes
Kubernetes (originated at Google in 1998, open‑sourced in 2014) is an automated container‑orchestration platform that provides deployment, scaling, and management of containerised applications. Its architecture consists of Master nodes (control plane) and Worker nodes, with key components:
API Server – entry point for all HTTP requests.
etcd – distributed key‑value store for cluster state.
Kubelet – runs on each worker, communicates with the API server and manages containers.
Kube‑proxy – handles network traffic between pods.
Scheduler – assigns pods to nodes based on resource requirements.
Controller‑manager – ensures the actual state matches the desired state stored in etcd.
Container engine (Docker) – runs the containers as instructed by Kubelet.
kubectl – CLI for interacting with the cluster.
Important Kubernetes concepts:
Pod – the smallest deployable unit, encapsulating one or more containers.
Volume – a storage abstraction that can be mounted into pods.
Deployment – manages a set of identical pods, handling replica count and versioning.
Service – provides a stable network endpoint for a group of pods.
Namespace – logical isolation within a cluster for resources and access control.
The official Airflow documentation recommends installing Airflow on Kubernetes via a Helm chart. Helm packages Kubernetes manifests (YAML) and simplifies configuration management.
Typical Helm workflow:
Fetch the chart from a repository.
Provide a custom values.yaml file to override defaults.
Merge the custom values with the chart defaults.
Render the final YAML manifests.
Apply the manifests to the cluster with kubectl apply .
Installing Helm on Windows
Run the following PowerShell command as Administrator to install Chocolatey (a package manager used to install Helm):
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))Then install Helm:
choco install kubernetes-helmAdd the Airflow chart repository and install/upgrade the release:
helm repo add apache-airflow https://airflow.apache.org
helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespaceAfter a successful installation the output includes the default credentials (username: admin , password: admin ) and a port‑forward command to access the Airflow web UI:
kubectl port-forward svc/airflow-webserver 8080:8080 --namespace airflowDAG Synchronisation
Three approaches exist for keeping DAG files consistent across the cluster; the guide adopts the GitSync method. In values.yaml set:
gitSync:
enabled: true
repo: ssh://[email protected]/yzengnash/learngit.git
branch: main
rev: HEAD
depth: 1
maxFailures: 0
subPath: "tests/dags"
sshKeySecret: airflow-ssh-secretCreate a Kubernetes secret that contains the private SSH key used for the Git repository:
kubectl create secret generic airflow-ssh-git-secret --from-file=gitSshKey=$HOME/.ssh/id_rsa -n airflowVerify the secret:
kubectl get secrets -n airflowUpgrade the Helm release with the new values:
helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debugHandling Image Pull Errors
When the cluster cannot pull the default git‑sync image due to network restrictions, a private mirror (e.g., Alibaba Cloud) can be used. Build a custom image, tag it to match the expected name, push it to the private registry, and reference it in the Helm values.
# Build custom image locally
# Tag to the original image name
docker tag registry.cn-hangzhou.aliyuncs.com/nash_image/image_repo:v1 k8s.gcr.io/git-sync/git-sync:v3.4.0
# Push to private registry
docker login registry.cn-hangzhou.aliyuncs.com
docker push registry.cn-hangzhou.aliyuncs.com/nash_image/image_repo:v1Re‑run the Helm upgrade command to apply the corrected image.
Git Installation and Configuration
Install Git for Windows ( download page ), then configure global user information:
git config --global "user.name" "Your Name"
git config --global "user.email" "[email protected]"Initialize a repository, add files, commit, and push to GitHub:
git init
git add .
git commit -m "Initial commit"
git remote add origin https://github.com/yzengnash/learngit.git
git branch -M main
git push -u origin mainGenerate an SSH key pair, add the public key to GitHub, and configure the remote URL to use SSH for authentication. If SSL verification causes errors, disable it temporarily:
git config --global http.sslVerify "false"The article concludes with a promise to cover other scheduling tools such as DolphinScheduler and DataWorks in future posts.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.