Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow 2 support #643

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 11 additions & 8 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
name: CI
name: Docker Image CI

on:
push:
branches:
- master
- '*'
pull_request:
branches:
- master
- master

jobs:
ci:

build:


runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v1
- run: docker build -t "${PWD##*/}" .
- run: docker run "${PWD##*/}" python -V
- run: docker run "${PWD##*/}" version
- uses: actions/checkout@v2
- name: Build the Docker image
run: docker build . --file Dockerfile --tag dataopssre/docker-airflow2:$(date +%s)
38 changes: 38 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: CI

on:
release:
types: [published]

jobs:
push_to_registry:
name: Push Docker image to Docker Hub
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: actions/checkout@v2

- name: Log in to Docker Hub
uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}

- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@98669ae865ea3cffbcbaa878cf57c20bbf1c6c38
with:
images: dataopssre/docker-airflow2

- name: Build and push Docker image
uses: docker/build-push-action@ad44023a93711e3deb337508980b4b5e9bcdc5dc
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
36 changes: 36 additions & 0 deletions .github/workflows/publish_helm_chart.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Release Charts

on:
push:
branches:
- master

jobs:
release:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
with:
fetch-depth: 0

- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "[email protected]"

- name: Install Helm
uses: azure/setup-helm@v1
with:
version: v3.4.0

- name: Add Helm repo dependency
run: helm repo add bitnami https://charts.bitnami.com/bitnami

- name: Run chart-releaser
uses: helm/[email protected]
with:
charts_dir: charts
env:
CR_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
CR_RELEASE_NAME_TEMPLATE: "helm-chart/{{ .Name }}-{{ .Version }}"
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Session.vim
*.tmlanguage.cache
*.tmPreferences.cache
*.stTheme.cache
.idea/

# workspace files are user-specific
*.sublime-workspace
Expand Down
89 changes: 13 additions & 76 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,86 +1,23 @@
# VERSION 1.10.9
# AUTHOR: Matthieu "Puckel_" Roisil
# DESCRIPTION: Basic Airflow container
# BUILD: docker build --rm -t puckel/docker-airflow .
# SOURCE: https://github.com/puckel/docker-airflow
FROM apache/airflow:2.2.3-python3.9

FROM python:3.7-slim-buster
LABEL maintainer="Puckel_"
LABEL maintainer="dataops-sre"

# Never prompt the user for choices on installation/configuration of packages
ENV DEBIAN_FRONTEND noninteractive
ENV TERM linux
ARG AIRFLOW_VERSION=2.2.3
ARG PYTHON_VERSION=3.9

# Airflow
ARG AIRFLOW_VERSION=1.10.9
ARG AIRFLOW_USER_HOME=/usr/local/airflow
ARG AIRFLOW_DEPS=""
ARG PYTHON_DEPS=""
ENV AIRFLOW_HOME=${AIRFLOW_USER_HOME}

# Define en_US.
ENV LANGUAGE en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LC_CTYPE en_US.UTF-8
ENV LC_MESSAGES en_US.UTF-8
RUN pip install apache-airflow[crypto,celery,postgres,hive,jdbc,mysql,ssh,kubernetes,snowflake${AIRFLOW_DEPS:+,}${AIRFLOW_DEPS}]==${AIRFLOW_VERSION} \
# plyvel Required by airflow, unknown why not installed as a dependency
&& pip install plyvel==1.3.0 \
# pyarrow 5.0 required by snowflake but possibly incompatible by airflow
&& pip install pyarrow==5.0.0 \
&& if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi

# Disable noisy "Handling signal" log messages:
# ENV GUNICORN_CMD_ARGS --log-level WARNING

RUN set -ex \
&& buildDeps=' \
freetds-dev \
libkrb5-dev \
libsasl2-dev \
libssl-dev \
libffi-dev \
libpq-dev \
git \
' \
&& apt-get update -yqq \
&& apt-get upgrade -yqq \
&& apt-get install -yqq --no-install-recommends \
$buildDeps \
freetds-bin \
build-essential \
default-libmysqlclient-dev \
apt-utils \
curl \
rsync \
netcat \
locales \
&& sed -i 's/^# en_US.UTF-8 UTF-8$/en_US.UTF-8 UTF-8/g' /etc/locale.gen \
&& locale-gen \
&& update-locale LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 \
&& useradd -ms /bin/bash -d ${AIRFLOW_USER_HOME} airflow \
&& pip install -U pip setuptools wheel \
&& pip install pytz \
&& pip install pyOpenSSL \
&& pip install ndg-httpsclient \
&& pip install pyasn1 \
&& pip install apache-airflow[crypto,celery,postgres,hive,jdbc,mysql,ssh${AIRFLOW_DEPS:+,}${AIRFLOW_DEPS}]==${AIRFLOW_VERSION} \
&& pip install 'redis==3.2' \
&& if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi \
&& apt-get purge --auto-remove -yqq $buildDeps \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf \
/var/lib/apt/lists/* \
/tmp/* \
/var/tmp/* \
/usr/share/man \
/usr/share/doc \
/usr/share/doc-base

COPY script/entrypoint.sh /entrypoint.sh
COPY config/airflow.cfg ${AIRFLOW_USER_HOME}/airflow.cfg

RUN chown -R airflow: ${AIRFLOW_USER_HOME}

EXPOSE 8080 5555 8793
COPY config/webserver_config.py $AIRFLOW_HOME/
COPY dags $AIRFLOW_HOME/dags

USER airflow
WORKDIR ${AIRFLOW_USER_HOME}
ENTRYPOINT ["/entrypoint.sh"]
CMD ["webserver"]
ENTRYPOINT ["/entrypoint.sh"]
77 changes: 47 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,65 @@
# docker-airflow
[![CI status](https://github.com/puckel/docker-airflow/workflows/CI/badge.svg?branch=master)](https://github.com/puckel/docker-airflow/actions?query=workflow%3ACI+branch%3Amaster+event%3Apush)
[![Docker Build status](https://img.shields.io/docker/build/puckel/docker-airflow?style=plastic)](https://hub.docker.com/r/puckel/docker-airflow/tags?ordering=last_updated)
# docker-airflow 2
[![Docker Image CI](https://github.com/dataops-sre/docker-airflow2/actions/workflows/ci.yml/badge.svg)](https://github.com/dataops-sre/docker-airflow2/actions/workflows/ci.yml)

[![Docker Hub](https://img.shields.io/badge/docker-ready-blue.svg)](https://hub.docker.com/r/puckel/docker-airflow/)
[![Docker Pulls](https://img.shields.io/docker/pulls/puckel/docker-airflow.svg)]()
[![Docker Stars](https://img.shields.io/docker/stars/puckel/docker-airflow.svg)]()
[![Docker Hub](https://img.shields.io/badge/docker-ready-blue.svg)](https://hub.docker.com/r/dataopssre/docker-airflow2)
[![Docker Pulls](https://img.shields.io/docker/pulls/dataopssre/docker-airflow2.svg)](https://hub.docker.com/r/dataopssre/docker-airflow2)
[![Docker Stars](https://img.shields.io/docker/stars/dataopssre/docker-airflow2.svg)](https://hub.docker.com/r/dataopssre/docker-airflow2)

This repository contains **Dockerfile** of [apache-airflow](https://github.com/apache/incubator-airflow) for [Docker](https://www.docker.com/)'s [automated build](https://registry.hub.docker.com/u/puckel/docker-airflow/) published to the public [Docker Hub Registry](https://registry.hub.docker.com/).
This repository contains **Dockerfile** of [apache-airflow2](https://github.com/apache/airflow) for [Docker](https://www.docker.com/)'s [automated build](https://hub.docker.com/r/dataopssre/docker-airflow2/) published to the public [Docker Hub Registry](https://hub.docker.com/r/dataopssre/docker-airflow2).

## TL,TR
Use Docker image [dataopssre/docker-airflow2](https://hub.docker.com/r/dataopssre/docker-airflow2) to update your airflow setup

Helm chart to deploy airflow2 docker image:
```
helm repo add dataops-sre-airflow https://dataops-sre.github.io/docker-airflow2/
helm repo update
helm install airflow dataops-sre-airflow/airflow --wait --timeout 300s
```

## Informations

* Based on Python (3.7-slim-buster) official Image [python:3.7-slim-buster](https://hub.docker.com/_/python/) and uses the official [Postgres](https://hub.docker.com/_/postgres/) as backend and [Redis](https://hub.docker.com/_/redis/) as queue
* Based on official Airflow 2 Image [apache/airflow2:2.2.3-python3.9
](https://hub.docker.com/_/python/) and uses the official [Postgres](https://hub.docker.com/_/postgres/) as backend and [Redis](https://hub.docker.com/_/redis/) as queue
* Docker entrypoint script is forked from [puckel/docker-airflow](https://github.com/puckel/docker-airflow)
* Install [Docker](https://www.docker.com/)
* Install [Docker Compose](https://docs.docker.com/compose/install/)
* Following the Airflow release from [Python Package Index](https://pypi.python.org/pypi/apache-airflow)

## Installation

Pull the image from the Docker repository.
## Motivation
This repo is forked form [puckel/docker-airflow](https://github.com/puckel/docker-airflow), the original repo seems not maintained.

Airflow is been updated to version 2 and release its [official docker image](https://hub.docker.com/r/apache/airflow), you can also find [bitnami airflow image](https://hub.docker.com/r/bitnami/airflow). Nevertheless, puckel's image is still interesting, in the market none of providers offer an Airflow run with LocalExecutor with scheduler in one container, it is extremely usefull when to deploy a simple Airflow to an AWS EKS cluster. With Kubernetes you can resolve Airflow scablity issue by using uniquely KubernetesPodOpetertor in your dags, then we need zero computational power for airflow, it serves pure purpose of scheduler, seperate scheduler and webserver into two different pods is a bit problematic on AWS EKS cluster, we want to keep dags and logs into a Persistant volume, but AWS has some limitation for EBS volume multi attach, which means webserver and scheduler pod has to be scheduled on the same EKS node, it is a bit annoying. Thus puckel's airflow startup script is usefull.

what this fork do :

* Disactive by default the login screen in Airflow 2
* Improve current script to only take into account Airflow environment variables
* Make sure docker compose files works
* Add Airflow2 deployment helm chart and release a public repository in Github

You can use the helm chart release in this repository, see [here](https://dataops-sre.github.io/docker-airflow2) to deploys airflow2 to a Kubernetes cluster.

docker pull puckel/docker-airflow

## Build

Optionally install [Extra Airflow Packages](https://airflow.incubator.apache.org/installation.html#extra-package) and/or python dependencies at build time :

docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" -t puckel/docker-airflow .
docker build --rm --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow .
docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" -t dataopssre/docker-airflow2 .
docker build --rm --build-arg PYTHON_DEPS="requests" -t dataopssre/docker-airflow2 .

or combined

docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow .
docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="requests" -t dataopssre/docker-airflow2 .

Don't forget to update the airflow images in the docker-compose files to puckel/docker-airflow:latest.

## Usage

By default, docker-airflow runs Airflow with **SequentialExecutor** :

docker run -d -p 8080:8080 puckel/docker-airflow webserver

If you want to run another executor, use the other docker-compose.yml files provided in this repository.
If you want to run another executor, use the docker-compose.yml files provided in this repository.

For **LocalExecutor** :

Expand All @@ -50,11 +69,10 @@ For **CeleryExecutor** :

docker-compose -f docker-compose-CeleryExecutor.yml up -d

NB : If you want to have DAGs example loaded (default=False), you've to set the following environment variable :
NB : If you want to have DAGs example loaded (default=False), you've to set the following environment variable in docker-compose files :

`LOAD_EX=n`
`AIRFLOW__CORE__LOAD_EXAMPLES=True`

docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow

If you want to use Ad hoc query, make sure you've configured connections:
Go to Admin -> Connections and Edit "postgres_default" set this values (equivalent to values in airflow.cfg/docker-compose*.yml) :
Expand All @@ -65,26 +83,26 @@ Go to Admin -> Connections and Edit "postgres_default" set this values (equivale

For encrypted connection passwords (in Local or Celery Executor), you must have the same fernet_key. By default docker-airflow generates the fernet_key at startup, you have to set an environment variable in the docker-compose (ie: docker-compose-LocalExecutor.yml) file to set the same key accross containers. To generate a fernet_key :

docker run puckel/docker-airflow python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"
docker run dataopssre/docker-airflow2 python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"

## Configuring Airflow

It's possible to set any configuration value for Airflow from environment variables, which are used over values from the airflow.cfg.
It's possible to set any configuration value for Airflow from environment variables

The general rule is the environment variable should be named `AIRFLOW__<section>__<key>`, for example `AIRFLOW__CORE__SQL_ALCHEMY_CONN` sets the `sql_alchemy_conn` config option in the `[core]` section.

Check out the [Airflow documentation](http://airflow.readthedocs.io/en/latest/howto/set-config.html#setting-configuration-options) for more details
Check out the [Airflow documentation](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html) for more details

You can also define connections via environment variables by prefixing them with `AIRFLOW_CONN_` - for example `AIRFLOW_CONN_POSTGRES_MASTER=postgres://user:password@localhost:5432/master` for a connection called "postgres_master". The value is parsed as a URI. This will work for hooks etc, but won't show up in the "Ad-hoc Query" section unless an (empty) connection is also created in the DB

## Custom Airflow plugins

Airflow allows for custom user-created plugins which are typically found in `${AIRFLOW_HOME}/plugins` folder. Documentation on plugins can be found [here](https://airflow.apache.org/plugins.html)
Airflow allows for custom user-created plugins which are typically found in `${AIRFLOW_HOME}/plugins` folder. Documentation on plugins can be found [here](https://airflow.apache.org/docs/apache-airflow/stable/plugins.html)

In order to incorporate plugins into your docker container
- Create the plugins folders `plugins/` with your custom plugins.
- Mount the folder as a volume by doing either of the following:
- Include the folder as a volume in command-line `-v $(pwd)/plugins/:/usr/local/airflow/plugins`
- Include the folder as a volume in command-line `-v $(pwd)/plugins/:/opt/airflow/plugins`
- Use docker-compose-LocalExecutor.yml or docker-compose-CeleryExecutor.yml which contains support for adding the plugins folder as a volume

## Install custom python package
Expand All @@ -111,20 +129,19 @@ This can be used to scale to a multi node setup using docker swarm.

If you want to run other airflow sub-commands, such as `list_dags` or `clear` you can do so like this:

docker run --rm -ti puckel/docker-airflow airflow list_dags
docker run --rm -ti dataopssre/docker-airflow2 airflow dags list

or with your docker-compose set up like this:

docker-compose -f docker-compose-CeleryExecutor.yml run --rm webserver airflow list_dags
docker-compose -f docker-compose-CeleryExecutor.yml run --rm webserver airflow dags list

You can also use this to run a bash shell or any other command in the same environment that airflow would be run in:

docker run --rm -ti puckel/docker-airflow bash
docker run --rm -ti puckel/docker-airflow ipython
docker run --rm -ti dataopssre/docker-airflow2 bash
docker run --rm -ti dataopssre/docker-airflow2 ipython

# Simplified SQL database configuration using PostgreSQL

If the executor type is set to anything else than *SequentialExecutor* you'll need an SQL database.
Here is a list of PostgreSQL configuration variables and their default values. They're used to compute
the `AIRFLOW__CORE__SQL_ALCHEMY_CONN` and `AIRFLOW__CELERY__RESULT_BACKEND` variables when needed for you
if you don't provide them explicitly:
Expand Down
23 changes: 23 additions & 0 deletions charts/airflow/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
Loading