This repo demonstrates how to use the Selenium web driver, to automate a daily task on the web, in a Dockerized airflow environment. The environment used for this project was Ubuntu 18.04 on AWS EC2.
Set up an environment and ensure that ports 22 and 8080 are open.
ssh into the environment:
Clone the repo:
git clone https://github.com/HDaniels1991/airflow_selenium.git
Run the setup script, this will install docker engine and compose:
bash setup.sh
Create the required Docker network to enable the containers to communicate.
docker network create container_bridge
Create the named volume used to persist downloaded files.
docker volume create downloads
Extend the Selenium image to grant the Selenium user write permissions on the folder used for downloads.
docker build -t docker_selenium -f Dockerfile-selenium .
Extend the Airflow image to grant the container access to the host docker socket, install the requirements and create the downloads folder. The {AIRFLOW_USER_HOME} directory is also added to th python path to enable custom python modules.
docker build -t docker_airflow -f Dockerfile-airflow .
Run the docker compose:
docker-compose up
The Airflow webserver will be available at the following location:
- {Public DNS}:8080
The Selenium Airflow plugin works by setting up a remote Selenium server on the host using Docker, connecting to the web-driver (standalone-chrome) and sending commands using the Python API.
- Create docker container.
- Connect and configure driver.
- Execute Python code.
- Check Execution.
- Remove container.
The Dag is designed to download a daily podcast from the BBC called wake up to money and upload it to S3.
Harry Daniels