HLoader is a tool built around Apache Sqoop and Oozie for data ingestion from relational databases into Hadoop
- Clone the repository (git clone https://github.com/cerndb/hloader.git)
- Install the requirements (pip install -r requirements.txt) and Oracle .rpm files located in /travis-resources
- Set up the config.ini (properties starting with AUTH_ represent the authentication database, while properties starting with POSTGRE_ represent the meta database)
- Run HLoader.py
In case you would like to set up the meta database yourself, the script is located in /hloader/db/PostgreSQL_backend_schema.sql
The REST API exposes meta data to the user and enables the submission of new ingestion jobs
It runs on http://127.0.0.1:5000
GET /headers
returns PYTHON ENVIRONMENT VARIABLES and REQUEST HEADERS
GET /api/v1
the index page
GET /api/v1/clusters
returns a json with an array of clusters, potentially filtered by an attribute value
GET /api/v1/servers
returns a json with an array of servers, potentially filtered by an attribute value
GET /api/v1/schemas
returns json with arrays of available and unavailable schemas given an owner username. Reuired parameter: owner_username
GET /api/v1/jobs
returns a json with an array of jobs, potentially filtered by an attribute value
POST /api/v1/jobs
submits a job and returns a json containing its ID.
Required parameters: source_server_id, source_schema_name, source_object_name, destination_cluster_id, destination_path, owner_username, workflow_suffix, sqoop_direct
Optional parameters: coordinator_suffix, sqoop_nmap, sqoop_splitting_column, sqoop_incremental_method, start_time, end_time, interval, job_last_update
DELETE /api/v1/jobs
deletes a job given its ID and reports the status of the operation. Required parameter: job_id
GET /api/v1/logs
returns a json with an array of logs, potentially filtered by an attribute value
GET /api/v1/transfers
returns a json with an array of transfers, potentially filtered by an attribute value
Currently the submitted jobs will be executed using Oozie Workflows or Coordinators. The path to the Workflow/Coordinator app on HDFS should be provided in the workflow_suffix / coordinator_suffix parameters respectively. The URL to the Oozie deployment is contained in the clusters meta data.
Sample insert statements for meta data can be found in /hloader/db/PostgreSQL_test_data.sql