This cluster in docker contains Apache Atlas (and its dependencies) with a minimal Hadoop ecosystem to perform some basic experiments. More precisely, it provides:
- Kafka and Zookeeper (Atlas also depends on them);
- HDFS 2.7;
- Hive 2.3.2;
- Spark 3.3.
- run
docker-compose build
to build this all docker image or - run
docker-compose up
to start all cluster services (it may take some time). - wait for the above message:
atlas-server_1 | Apache Atlas Server started!!!
atlas-server_1 |
atlas-server_1 | waiting for atlas to be ready
atlas-server_1 | .....
atlas-server_1 | Server: Apache Atlas
or verify that server is up and running using:
curl -u admin:admin http://localhost:21000/api/atlas/admin/version
- Access Atlas server (
http://localhost:21000
) using user admin and password admin
Load data into Hive:
$ docker-compose exec hive-server bash
# /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
> CREATE TABLE pokes (foo INT, bar STRING);
> LOAD DATA LOCAL INPATH '/opt/hive/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
or
create table brancha(full_name string, ssn string, location string);
create table branchb(full_name string, ssn string, location string);
insert into brancha(full_name,ssn,location) values ('ryan', '111-222-333', 'chicago');
insert into brancha(full_name,ssn,location) values ('brad', '444-555-666', 'minneapolis');
insert into brancha(full_name,ssn,location) values ('rupert', '000-000-000', 'chicago');
insert into brancha(full_name,ssn,location) values ('john', '555-111-555', 'boston');
insert into branchb(full_name,ssn,location) values ('jane', '666-777-888', 'dallas');
insert into branchb(full_name,ssn,location) values ('andrew', '999-999-999', 'tampa');
insert into branchb(full_name,ssn,location) values ('ryan', '111-222-333', 'chicago');
insert into branchb(full_name,ssn,location) values ('brad', '444-555-666', 'minneapolis');
create table branch_intersect as select b1.full_name,b1.ssn,b1.location from brancha b1 inner join branchb b2 ON b1.ssn = b2.ssn;
To run some code in Spark:
$ docker-compose exec spark bash
# /usr/local/spark/bin/spark-submit /tmp/teste.py
- All project is quite large, for instance, Apache atlas has 1.0GB because of atlas bins and its dependencies (hbase and solr).
- Also startup may take some time depending on HW resources ...
- If you want, you may start only a fell components
docker-compose up atlas-server spark
- Atlas server was based on Rokku project
- Hive and HDFS was based on bde2020 project