Skip to content

Latest commit

 

History

History
130 lines (103 loc) · 5.85 KB

all.dataset_onboarding_checklist.reference.md

File metadata and controls

130 lines (103 loc) · 5.85 KB

Dataset onboarding checklist

We follow a standard flow when onboarding a new dataset.

Create a GH issue and paste the following checklist in the issue specification whenever a request for a new dataset is made. The structure is pre-formatted as a markdown checklist.

Preparation and exploratory analysis

From docs/datapull/all.dataset_onboarding_checklist.reference.md

  • Decide on the timeline
    • E.g., is this a high-priority dataset or a nice-to-have?
  • Decide on the course of action
    • E.g., do we download only historical bulk data and/or also prepare a real-time downloader?
  • Review existing code
    • Is there any downloader that is similar to the new one, in terms of interface, frequency, etc.?
    • What code already existing can be generalized to accomplish the task at hand?
    • What needs to be implemented from scratch?
  • Create an exploratory notebook that includes:
    • Description of the data type, if this is the first time downloading a certain data type
    • Example code to obtain a snippet of historical/real-time data
    • If we are interested in historical data, e.g.,
      • How far in the past we need the data to be?
      • How far in the past the data source goes?
  • Create example code to obtain data in realtime
    • Is there any issue with the realtime data?
      • E.g., throttling, issues with APIs, unreliability
  • Perform initial QA on the data sample, e.g.,
    • Compute some statistics in terms of missing data, outliers
    • Does real-time and historical data match at first sight in terms of schema and content

Implement historical downloader

  • Decide what's the name of the data set according to dataset_schema conventions
  • Implement the code to perform the historical downloader
    • TODO(Juraj): Add a pointer to examples and docs
  • Test the flow to download a snippet of data locally in the test stage
    • Apply QA to confirm data is being downloaded correctly
  • Perform a bulk download for historical datasets
    • Manually, i.e., via executing a script, if the history is short or the volume of data is low
    • Via an Airflow DAG if the volume of the data is too large for downloading manually
      • E.g., im_v2/airflow/dags/test.download_bulk_data_fargate_example_guide.py

Automated AKA Scheduled downloader

  • Setup automatic download of data in pre-production:

    • Since pre-prod runs with code from the master branch (updated twice a day automatically), make sure to merge any PRs related to the dataset onboarding first
    • For historical datasets:
      • To provide a single S3 location to access the entire dataset, move the bulk history from the test bucket to the pre-prod bucket (source and destination path should be identical)
      • Add a daily download Airflow task to get data from a previous day and append it to the existing bulk dataset
    • For real-time datasets:
      • Add a real-time download Airflow task to get data continuously 24/7
  • For some real-time datasets, an archival flow needs to be added in order not to overwhelm the storage

  • Add an entry into the

  • Once the download is enabled in production, update the Master_raw_data_gallery

Quality Assurance

1. Check for Existing QA DAGs

  • Verify if there is already a similar QA DAG running.
    • Check for existing QA DAGs (e.g., bid_ask/OHLCV, Cross QA for OHLCV comparing real-time with historical data).
    • Action: If the new QA is just a change in the universe or vendor, append a new task to the existing running DAGs. Reference: Link to Relevant Section].

2. Create a New QA DAG (if necessary)

2.1. Create and Test QA Notebook

  • Develop a notebook to test the QA process.
    • Test over a small period to ensure it functions as expected.
    • Tip: Use a small dataset or limited time frame for quick testing.

2.2. Run QA Notebook via Invoke Command

2.3. Create a New DAG File

  • Create a new DAG file after QA process validation.

Last review: GP on 2024-04-20