- This will create a Docker container of Apache Drill for analyzing file-based data (e.g., parquet files).
- While creating the container, the image will also generate a
/data
folder for the data files. - This image will also extract the jdbc driver jar of Apache Drill to connect SQL user interfaces (e.g., Data Grip) to the Drill container.
This image allows you to generate an out-of-the-box analytics environment without cluttering your machine. If this is not your cup of tea, head over to the instructions for a standard installation of Apache Drill.
Install Docker by following the instructions here. If you do not have an account with Docker, you may be asked to create one.
Ensure that Docker is correctly running using the following command:
docker version
You should see a result similar to the following:
Client: Docker Engine - Community
Version: 18.09.2
API version: 1.39
Go version: go1.10.8
Git commit: 6247962
Built: Sun Feb 10 04:12:39 2019
OS/Arch: darwin/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 04:13:06 2019
OS/Arch: linux/amd64
Experimental: false
Tip: Use the Docker app to adjust the memory, swap, and CPUs allocated to Docker containers.
The docker image contains the following files:
- The
docker-compose.yml
file includes the description of how to build and configure the container that will run Apache Drill. - The
.env
file contains the parameterDRILL_VERSION
that determines which version of Apache Drill is being built and run. Currently, theDRILL_VERSION
is set to1.16.0
. If you need to change this, adapt the.env
file. - The
/build/Dockerfile
contains the build descriptions for the container. - The
run_drill.sh
contains the startup script for Apache Drill. - The
.gitignore
prevents that any file in the data folder or any parquet/csv files will be added to the repository.
Use the GitHub Desktop or run the following commands (assuming that you have git installed):
- Check for git
git version
You should see a result similar to the following:
git version 2.20.1 (Apple Git-117)
- Check out the repo into a directory of your choice:
git clone https://github.com/mschermann/docker_apache_drill_datagrip.git
- Check the configuration:
docker-compose config
You should see a result similar to the following:
services:
drill:
build:
args:
DRILL_VERSION: 1.16.0
context: /<YOUR PATH>/build
dockerfile: Dockerfile
command: ./run_drill.sh
container_name: drill
environment:
DRILL_VERSION: 1.16.0
hostname: drill
ports:
- 8047:8047/tcp
- 31010:31010/tcp
restart: on-failure
tty: true
volumes:
- /<YOUR PATH>/data:/data:rw
- Build the container
docker-compose build
You should see that docker starts to build the container. This will take a while depending on your internet speed and machine configuration.
Building drill
Step 1/13 : FROM centos:latest
...
- Start and stop the container
docker-compose up
You should see that docker starts the container. When you see the Drill message of the day, Drill is up and running:
Attaching to e382a3c16e10_drill
e382a3c16e10_drill | Apache Drill 1.16.0
e382a3c16e10_drill | "There are two types of analysts in the world: those who use Drill and those who don't."
Additionally, this step has also extracted the JDBC driver for Apache Drill (e.g., drill-jdbc-all-1.16.0.jar
) into the /build
folder.
You can stop the container with Control+C
, which should result in the following output:
Gracefully stopping... (press Ctrl+C again to force)
Stopping e382a3c16e10_drill ... done
Drill is starting a web GUI at http://localhost:8047.
If you click on Query
, you can run SQL queries directly from the browser (Do not use this for any heavy-load querying).
Make sure that everything works fine by entering the example query SELECT * FROM cp.
employee.json LIMIT 20
.
It will show you a waiting screen and, if everything works fine, the results.
Now, head over to the Drill Documentation and start learning how to use Drill.
Let's connect Data Grip to the Drill container.
- Create a new Driver in Data Grip by pointing towards the JDBC driver for Apache Drill in the
/build
folder.
- Create a data source using the Drill driver. Test the connection and make sure you get the green checkmark.
- Run a Sample Query
Using the same query as above (SELECT * FROM cp.
employee.json LIMIT 20
), you should see the following output.
At this point, you are all set. Add your data files to the /data
folder, and you should be able to query them.
If you use parquet data files, the following command will give you the five rows of the data.
SELECT * FROM dfs.`/data` LIMIT 5;
Head over to the Drill documentation for a more in-depth explanation and help.
You may want to access the Drill container at some point in time. The following step show how to connect to the container.
- Find the name of your container.
docker ps
This should result in an output like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e382a3c16e10 docker_drill_parquet_datagrip_drill "./run_drill.sh" About an hour ago Up 47 seconds 0.0.0.0:8047->8047/tcp, 0.0.0.0:31010->31010/tcp e382a3c16e10_drill
- Access the container
From the
NAMES
column, you can see that this container is callede382a3c16e10_drill
. You can connect to this container using the following command:
docker exec -it e382a3c16e10_drill bash
This will result in a prompt inside the container:
[root@drill drill]#
- This repo was inspired by this post by Mattia Casotto.
- The Dockerfile is an adapted version from Apache's Drill Dockerfile