Comprehensive Guide to Installing and Using Apache Airflow with Docker on Windows
This article provides a detailed tutorial on Apache Airflow fundamentals, Docker-based installation on Windows, Dockerfile creation, container deployment via Docker run and Docker Compose, Airflow configuration, and practical usage of DAGs, tasks, connections, and UI features for data pipeline orchestration.
Apache Airflow, originally developed at Airbnb in 2014 and later graduated to an Apache Top-Level Project in 2019, is a Python‑based workflow orchestration platform that uses directed acyclic graphs (DAGs) to define and schedule data pipelines, offering features such as task dependencies, monitoring, and extensibility with many integrations (e.g., AWS S3, Docker, Hadoop, Hive, Kubernetes, MySQL, Postgres, Zeppelin).
The article explains key Airflow concepts: Data Pipeline, DAGs, Tasks (operators like BashOperator and PythonOperator), Connections, Pools, XComs, Trigger Rules, Backfill, the Airflow 2.0 API, and the AIRFLOW_HOME directory for DAG and plugin storage.
For Windows users, the guide recommends installing Docker Desktop (or WSL2) and outlines the steps to create a custom Dockerfile that extends the official apache/airflow:2.3.0 image, installs additional Linux tools, copies a requirements.txt file, installs Python dependencies, adds DAG scripts, and creates a writable directory.
# Use the official Airflow image
FROM apache/airflow:2.3.0
# Switch to root to install system packages
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
vim \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Switch back to airflow user for pip installs
USER airflow
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt
# Copy DAG and data files with proper ownership
COPY --chown=airflow:root BY02_AirflowTutorial.py /opt/airflow/dags
COPY src/data.sqlite /opt/airflow/data.sqlite
# Create a writable directory
RUN umask 0002; \
mkdir -p ~/writeable_directoryTwo deployment methods are described:
Build the image with docker build -t airflow:latest . and run it using docker run -it --name test -p 8080:8080 --env "_AIRFLOW_DB_UPGRADE=true" --env "_AIRFLOW_WWW_USER_CREATE=true" --env "_AIRFLOW_WWW_USER_PASSWORD=admin" airflow:latest airflow standalone .
Use Docker Compose: define services in a docker-compose.yml (including Airflow, PostgreSQL, Redis, and workers), place an .env file with AIRFLOW_UID=50000 , then execute docker-compose up to launch all containers.
After deployment, the article shows how to initialize the metadata database ( airflow db init ), create an admin user, and start the webserver and scheduler either via airflow standalone or the explicit commands:
airflow db init
airflow users create \
--username admin \
--firstname Peter \
--lastname Parker \
--role Admin \
--email [email protected]
airflow webserver --port 8080
airflow schedulerConfiguration details are covered, including editing airflow.cfg to set the SQLAlchemy connection string, choosing an executor, disabling example DAGs ( AIRFLOW__CORE__LOAD_EXAMPLES=False ), and customizing the UI.
The UI usage is demonstrated: enabling the left‑hand switch, starting DAG runs via the UI, CLI, or HTTP API, and inspecting task logs, clearing failed tasks, and visualizing the DAG graph and tree views.
Overall, the guide serves as a practical reference for data engineers who need to set up, configure, and operate Apache Airflow in a containerized Windows environment.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.