Dataset onboarding checklist

Dataset onboarding checklist

We follow a standard flow when onboarding a new dataset.

Create a GH issue and paste the following checklist in the issue specification whenever a request for a new dataset is made. The structure is pre-formatted as a markdown checklist.

Preparation and exploratory analysis

From docs/datapull/all.dataset_onboarding_checklist.reference.md

Decide on the timeline
- E.g., is this a high-priority dataset or a nice-to-have?
Decide on the course of action
- E.g., do we download only historical bulk data and/or also prepare a real-time downloader?
Review existing code
- Is there any downloader that is similar to the new one, in terms of interface, frequency, etc.?
- What code already existing can be generalized to accomplish the task at hand?
- What needs to be implemented from scratch?
Create an exploratory notebook that includes:
- Description of the data type, if this is the first time downloading a certain data type
- Example code to obtain a snippet of historical/real-time data
- If we are interested in historical data, e.g.,
  - How far in the past we need the data to be?
  - How far in the past the data source goes?
Create example code to obtain data in realtime
- Is there any issue with the realtime data?
  - E.g., throttling, issues with APIs, unreliability
Perform initial QA on the data sample, e.g.,
- Compute some statistics in terms of missing data, outliers
- Does real-time and historical data match at first sight in terms of schema and content

Implement historical downloader

Decide what's the name of the data set according to dataset_schema conventions
Implement the code to perform the historical downloader
- TODO(Juraj): Add a pointer to examples and docs
Test the flow to download a snippet of data locally in the test stage
- Apply QA to confirm data is being downloaded correctly
Perform a bulk download for historical datasets
- Manually, i.e., via executing a script, if the history is short or the volume of data is low
- Via an Airflow DAG if the volume of the data is too large for downloading manually
  - E.g., im_v2/airflow/dags/test.download_bulk_data_fargate_example_guide.py

Automated AKA Scheduled downloader

Setup automatic download of data in pre-production:
- Since pre-prod runs with code from the master branch (updated twice a day automatically), make sure to merge any PRs related to the dataset onboarding first
- For historical datasets:
  - To provide a single S3 location to access the entire dataset, move the bulk history from the test bucket to the pre-prod bucket (source and destination path should be identical)
  - Add a daily download Airflow task to get data from a previous day and append it to the existing bulk dataset
- For real-time datasets:
  - Add a real-time download Airflow task to get data continuously 24/7
For some real-time datasets, an archival flow needs to be added in order not to overwhelm the storage
- Consult with the team leader if it's needed for a particular dataset
- Example Airflow DAG is preprod.europe.postgres_data_archival_to_s3.py
Add an entry into the
- Monster dataset matrix
Once the download is enabled in production, update the Master_raw_data_gallery

Quality Assurance

1. Check for Existing QA DAGs

Verify if there is already a similar QA DAG running.
- Check for existing QA DAGs (e.g., bid_ask/OHLCV, Cross QA for OHLCV comparing real-time with historical data).
- Action: If the new QA is just a change in the universe or vendor, append a new task to the existing running DAGs. Reference: Link to Relevant Section].

2. Create a New QA DAG (if necessary)

2.1. Create and Test QA Notebook

Develop a notebook to test the QA process.
- Test over a small period to ensure it functions as expected.
- Tip: Use a small dataset or limited time frame for quick testing.

2.2. Run QA Notebook via Invoke Command

Execute the QA notebook using the invoke command to validate functionality.
- Example: Invoke Command Example

2.3. Create a New DAG File

Create a new DAG file after QA process validation.
- Follow the standard procedure for DAG creation. Reference: DAG Creation Tutorial.

Last review: GP on 2024-04-20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all.dataset_onboarding_checklist.reference.md

all.dataset_onboarding_checklist.reference.md

Dataset onboarding checklist

Preparation and exploratory analysis

Implement historical downloader

Automated AKA Scheduled downloader

Quality Assurance

1. Check for Existing QA DAGs

2. Create a New QA DAG (if necessary)

2.1. Create and Test QA Notebook

2.2. Run QA Notebook via Invoke Command

2.3. Create a New DAG File

Files

all.dataset_onboarding_checklist.reference.md

Latest commit

History

all.dataset_onboarding_checklist.reference.md

File metadata and controls

Dataset onboarding checklist

Preparation and exploratory analysis

Implement historical downloader

Automated AKA Scheduled downloader

Quality Assurance

1. Check for Existing QA DAGs

2. Create a New QA DAG (if necessary)

2.1. Create and Test QA Notebook

2.2. Run QA Notebook via Invoke Command

2.3. Create a New DAG File