Universal Data is a lightweight ELT/ETL that's perfect for api data extraction and transformation.
- Extract data from any api with a simple configuration
- Auto-detection of data types
- Transform data with simple python functions
Python scripts + Postgres (database to store pipeline configuration and data)
ELT-T (Extract - Load - Transform - Transfer)
- Extract: python script to extract data from an api
- Load: in a postgres database spin up for it (source_{source_id})
- Transform: base python script to infer data types and transform data
- Transfer: (optional) python script to transfer data to a target database
extract/
sources/
: the config files for sourcesparser.py
: api response parser (json, xml, ...)scraper.py
: all the scraping methodsutils.py
: utils fonctions
load/
base.py
: sqlalchemy model to load data into the destination
transfer/
transfer.py
: transfer data from source to destination
transform/
transformations/
: post-processing transformations for specific sourcesmodel.py
: functions for modelisation / normalization
database.py
: utils for database/datawarehouserun.py
: script to run a pipeline/task(s)server.py
: server to manage all pipelines
pip install -r requirements.txt
- Modify .env file
- Create database with
python database.py
- Everything can be manage directly on the PostgreSQL database
-- Create a new client
INSERT INTO "public"."clients"("id", "name") VALUES(1, 'default')
-- Create a new source
INSERT INTO "public"."sources"("id", "name", "config", "client_id")
VALUES(1, 'hacker_news', NULL, 1)
-- Create a new target
INSERT INTO "public"."targets"("id", "type", "uri", "client_id")
VALUES(1, 'universal-load', 'postgresql+psycopg2://localhost:5432/universal-load', 1)
-- Create a new pipeline
INSERT INTO "public"."pipelines"("source_id", "target_id", "active") VALUES(1, 1, TRUE)
- Create a new file in
sources/
Run server
python server.py
Run a pipeline
python run.py 1 --extract --transform --transfer