- KaizenFlow workflow explanation
- Work organization
- Set-up
- Quant workflows
- Quant dev workflows
- TradingOps workflows
- MLOps workflows
- DevOps workflows
This document is a roadmap of most activities that Quants, Quant devs, and
DevOps can perform using KaizenFlow
.
For each activity we point to the relevant resources (e.g., documents in docs
,
notebooks) in the repo.
A high-level description of KaizenFlow is KaizenFlow White Paper
- Issues workflow explained
amp/docs/work_organization/ck.issue_workflow.explanation.md
- GitHub and ZenHub workflows explained
/docs/work_organization/all.use_github_and_zenhub.how_to_guide.md
- TODO(Grisha): add more from
/docs/work_organization/
.
- TODO(gp): Add pointers to the docs we ask to read during the on-boarding
-
The dir
docs/documentation_meta
contains documents about writing the documentation -
Conventions and suggestions on how to create diagrams in the documentation
-
A summary of how to create how-to, tutorial, explanations, reference according to the Diataxis framework
-
Writing documentation in Google Docs
-
Writing documentation in Markdown
-
Plotting in Latex
The life of a Quant is spent between:
- Exploring the raw data
- Computing features
- Building models to predict output given features
- Assessing models
These activities are mapped in KaizenFlow
as follows:
- Exploring the raw data
- This is performed by reading data using
DataPull
in a notebook and performing exploratory analysis
- This is performed by reading data using
- Computing features
- This is performed by reading data using
DataPull
in a notebook and creating someDataFlow
nodes
- This is performed by reading data using
- Building models to predict output given features
- This is performed by connecting
DataFlow
nodes into aDag
- This is performed by connecting
- Assessing models
- This is performed by running data through a
Dag
in a notebook or in a Python script and post-processing the results in an analysis notebook
- This is performed by running data through a
- Comparing models
- The parameters of a model are exposed through a
Config
and then sweep overConfig
lists
- The parameters of a model are exposed through a
- General intro to
DataPull
-
Universe explanation
-
Analyze universe metadata
-
Organize and label datasets
- Helps to uniquely identify datasets across different sources, types, attributes etc.
- /docs/datapull/all.data_schema.explanation.md
- /docs/datapull/ck.handle_datasets.how_to_guide.md
-
Inspect RawData
-
Convert data types
-
Data download pipelines explanation
-
Download data in bulk
- /im_v2/common/data/extract/download_bulk.py
- /im_v2/ccxt/data/extract/download_exchange_data_to_db.py
- TODO(Juraj): technically this could be joined into one script and also generalized for more sources
-
Download data in real time over a given time interval
-
Archive data
- Helps with optimizing data storage performance/costs by transferring older data from a storage like postgres to S3
- Suitable to apply to high frequency high volume realtime orderbook data
- /im_v2/ccxt/db/archive_db_data_to_s3.py
-
Resampling data
-
ImClient
-
MarketData
-
How to QA data
- /docs/datapull/ck.datapull_data_quality_assurance.reference.md
- /im_v2/ccxt/data/qa/notebooks/data_qa_bid_ask.ipynb
- /im_v2/ccxt/data/qa/notebooks/data_qa_ohlcv.ipynb
- /im_v2/common/data/qa/notebooks/cross_dataset_qa_ohlcv.ipynb
- /im_v2/common/data/qa/notebooks/cross_dataset_qa_bid_ask.ipynb
- /research_amp/cc/notebooks/Master_single_vendor_qa.ipynb
- /research_amp/cc/notebooks/Master_cross_vendor_qa.ipynb
- /research_amp/cc/notebooks/compare_qa.periodic.airflow.downloaded_websocket_EOD.all.bid_ask.futures.all.ccxt_cryptochassis.all.v1_0_0.ipynb
-
How to load
Bloomberg
data- /im_v2/common/notebooks/CmTask5424_market_data.ipynb
- TODO: Generalize the name and make it Master_
-
Kibot guide
-
Interactive broker guide
-
How to run IM app /docs/datapull/ck.run_im_app.how_to_guide.md
-
TODO(gp): Reorg /research_amp/cc/notebooks/Master_single_vendor_qa.ipynb /research_amp/cc/notebooks/Master_model_performance_analyser.old.ipynb /research_amp/cc/notebooks/Master_machine_learning.ipynb /research_amp/cc/notebooks/Master_cross_vendor_qa.ipynb /research_amp/cc/notebooks/Master_model_performance_analyser.ipynb /research_amp/cc/notebooks/Master_crypto_analysis.ipynb /research_amp/cc/notebooks/Master_model_prediction_analyzer.ipynb /research_amp/cc/notebooks/Master_Analysis_CrossSectionalLearning.ipynb /im/app/notebooks/Master_IM_DB.ipynb /im/ib/metadata/extract/notebooks/Master_analyze_ib_metadata_crawler.ipynb
-
Best practices for Quant research
- /docs/dataflow/ck.research_methodology.explanation.md
- TODO(Grisha):
ck.*
->all.*
?
-
A description of all the available generic notebooks with a short description
- /docs/dataflow/ck.master_notebooks.reference.md
- TODO(Grisha): does this belong to
DataFlow
? - TODO(Grisha):
ck.master_notebooks...
->all.master_notebooks
?
-
General concepts of
DataFlow
- Introduction to KaizenFlow, DAG nodes, DataFrame as unit of computation, DAG execution
- DataFlow data format
- Different views of System components, Architecture
- Conventions for representing time series
- Explanation of how to debug a DAG
-
Learn how to build a
DAG
- Build a
DAG
with two nodes - Build a more complex
DAG
implementing a simple risk model - Best practices to follow while building
DAG
- Build a
-
Learn how to run a
DAG
- Overview, DagBuilder, Dag, DagRunner
- Configure a simple risk model, build a DAG, generate data and connect data source to the DAG, run the DAG
- Build a DAG from a Mock2 DagBuilder and run it
-
General intro about model simulation
- Property of tilability, batch vs streaming
- Time semantics, How clock is handled, Flows
- Phases of evaluation of
Dag
s - Event study explanation
-
Run a simulation of a
DataFlow
system- Overview, Basic concepts, Implementation details
- How to build a system, run research backtesting, Process results of backtesting, How to run replayed time simulation, Running experiments
- Simulation output explanation
-
Run a simulation sweep using a list of
Config
parameters- /docs/dataflow/ck.run_backtest.how_to_guide.md
- TODO(gp): @grisha do we have anything here? It's like the stuff that Dan does
- TODO(Grisha): @Dan, add a link to the doc here once it is ready
-
Post-process the results of a simulation
- Build the Config dict, Load tile results, Compute portfolio bar metrics, Compute aggregate portfolio stats
- /dataflow/model/notebooks/Master_research_backtest_analyzer.ipynb
- TODO(Grisha): is showcasing an example with fake data enough? We could use Mock2 output
-
Analyze a
DataFlow
model in details- Build Config, Initialize ModelEvaluator and ModelPlotter
- /dataflow/model/notebooks/Master_model_analyzer.ipynb
- TODO(gp): @grisha what is the difference with the other?
- TODO(Grisha): ask Paul about the notebook
-
Analyze features computed with
DataFlow
- Read features from a Parquet file and perform some analysis
- TODO(gp): Grisha do we have a notebook that reads data from ImClient/MarketData and performs some analysis?
- TODO(Grisha): create a tutorial notebook for analyzing features using some real (or close to real) data
-
Mix multiple
DataFlow
models- /dataflow/model/notebooks/Master_model_mixer.ipynb
- TODO(gp): add more comments
-
Exporting PnL and trades
-
Learn how to build
System
- TODO(gp): @grisha what do we have for this?
- TODO(Grisha): add a tutorial notebook that builds a System and explain the flow step-by-step
-
Configure a full system using a
Config
- Fill the
SystemConfig
, build all the components and run theSystem
- /docs/dataflow/system/all.use_system_config.tutorial.ipynb
- Fill the
-
Create an ETL batch process using a
System
- /dataflow_amp/system/risk_model_estimation/run_rme_historical_simulation.py
- TODO(Grisha): add an explanation doc and consider converting into a Jupyter notebook.
-
Create an ETL real-time process
- DagBuilder, Dag, DagRunner
- Build a DAG that runs in real time
- /dataflow_amp/system/realtime_etl_data_observer/scripts/run_realtime_etl_data_observer.py
- TODO(Grisha): consider converting into a Jupyter notebook.
- Build a
System
that runs in real time- /dataflow_amp/system/realtime_etl_data_observer/scripts/DataObserver_template.run_data_observer_simulation.py
- TODO(Grisha): consider converting into a Jupyter notebook.
-
Batch simulation a Mock2
System
- Description of the forecast system, Description of the System, Run a backtest, Explanation of the backtesting script, Analyze the results
- /docs/kaizenflow/all.run_Mock2_in_batch_mode.how_to_guide.md
- Build the config, Load tiled results, Compute portfolio bar metrics, Compute aggregate portfolio stats
- /docs/kaizenflow/all.analyze_Mock2_pipeline_simulation.how_to_guide.ipynb
-
Run an end-to-end timed simulation of
Mock2
System
-
TODO(gp): reorg the following files /oms/notebooks/Master_PnL_real_time_observer.ipynb /oms/notebooks/Master_bid_ask_execution_analysis.ipynb /oms/notebooks/Master_broker_debugging.ipynb /oms/notebooks/Master_broker_portfolio_reconciliation.ipynb /oms/notebooks/Master_c1b_portfolio_vs_portfolio_reconciliation.ipynb /oms/notebooks/Master_dagger_reconciliation.ipynb /oms/notebooks/Master_execution_analysis.ipynb /oms/notebooks/Master_model_qualifier.ipynb /oms/notebooks/Master_multiday_system_reconciliation.ipynb /oms/notebooks/Master_portfolio_vs_portfolio_reconciliation.ipynb /oms/notebooks/Master_portfolio_vs_research_stats.ipynb /oms/notebooks/Master_system_reconciliation_fast.ipynb /oms/notebooks/Master_system_reconciliation_slow.ipynb /oms/notebooks/Master_system_run_debugger.ipynb
-
Learn how to create a
DataPull
adapter for a new data source -
How to update CCXT version
-
Download
DataPull
historical data- ?
-
Onboard new exchange
-
Put a
DataPull
source in production with Airflow- /docs/datapull/ck.create_airflow_dag.tutorial.md
- TODO(gp): This file is missing
- /docs/datapull/ck.develop_an_airflow_dag_for_production.explanation.md
- TODO(Juraj): See https://github.com/cryptokaizen/cmamp/issues/6444
- /docs/datapull/ck.create_airflow_dag.tutorial.md
-
Add QA for a
DataPull
source -
Compare OHLCV bars
- /im_v2/ccxt/data/client/notebooks/CmTask6537_One_off_comparison_of_Parquet_and_DB_OHLCV_data.ipynb
- TODO(Grisha): review and generalize
-
How to import
Bloomberg
historical data -
How to import
Bloomberg
real-time data- TODO(*): add doc.
-
TODO(gp): Add docs /docs/datapull/ck.binance_trades_data_pipeline.explanation.md /docs/datapull/ck.database_schema_update.how_to_guide.md /docs/datapull/ck.datapull.explanation.md /docs/datapull/ck.relational_database.explanation.md
- All software components
- Binance trading terms
- OMS explanation
- CCXT log structure
- Replayed CCXT exchange explanation
- How to generate broker test data
- Trading procedures (e.g., trading account information)
- How to run broker only/full system experiments
- Execution notebooks explanation
- Encrypt a model
- Model deployment in production
- Run production system
- Model references
- Monitor system
- System reconciliation explanation
- System Reconciliation How to guide
The documentation outlines the architecture and deployment processes for the Kaizen Infrastructure, leveraging a blend of AWS services, Kubernetes for container orchestration, and traditional EC2 for virtualized computing. Emphasizing Infrastructure as Code (IaC), the project employs Terraform for provisioning and Ansible for configuration, ensuring a maintainable and replicable environment.
-
Development and deployment stages
-
S3 Buckets overview
- /docs/infra/ck.s3_buckets.explanation.md
- This document provides an overview of the S3 buckets utilized by Kaizen Technologies.
-
Document details steps for setting up Kaizen infrastructure
-
EC2 servers overview
-
Document the implementation of Auto Scaling in the Kubernetes setup, focusing on the Cluster Autoscaler (CA), Horizontal Pod Autoscaler (HPA), and Auto Scaling Groups (ASG)
-
Compare AWS RDS instance types and storage performance
-
Setup S3 buckets with Terraform
-
AWS API Key rotation guide
-
Amazon Elastic File System (EFS) overview
-
Client VPN endpoint creation with Terraform
-
Set-up AWS Client VPN
-
Utility server application set-up overview
-
Storing secret information (API keys, login credentials, access tokens etc.)