Skip to content

Latest commit

 

History

History
402 lines (258 loc) · 21.3 KB

architecture.md

File metadata and controls

402 lines (258 loc) · 21.3 KB

Experiment Architecture

<script>mermaid.initialize({startOnLoad:true});</script>

This document is aimed at people wishing to contribute to Triage development. It explains the design and architecture of the Experiment class.

Dependency Graphs

For a general overview of how the parts of an experiment depend on each other, refer to the graphs below.

Experiment (high-level)

graph TD TC[Timechop] subgraph Architect LG[Label Generator] EDG[Entity-Date Generator] FG["Feature Generator (+ feature groups)"] MB[Matrix Builder] end subgraph Catwalk, per-model MT[Model Trainer] PR[Predictor] PG[Protected Group Generator] EV[Model Evaluator] end TC --> LG TC --> EDG TC --> FG LG --> MB EDG --> MB FG --> MB MB --> MT MB --> PR MT --> PR EDG --> PG PG --> EV PR --> EV

The FeatureGenerator section above hides some details to make the overall graph flow more concise. To support feature grouping, there are more operations that happen between feature table creation and matrix building. The relevant section of the dependency graph is expanded below, along with the output that each pair of components sends between each other within the arrow

Feature Dependency Details

graph TD TC[Timechop] FG[Feature Generator] FDG[Feature Dictionary Generator] FGC[Feature Group Creator] FGM[Feature Group Mixer] PL[Planner] MB[Matrix Builder] TC -- as-of-dates --> FG FG -- feature tables --> FDG FDG -- master feature dictionary --> FGC FGC -- feature groups --> FGM FGM -- recombined feature groups --> PL TC -- time splits --> PL FG -- feature tables --> MB PL -- matrix build tasks --> MB
## Component List and Input/Output

These are where the interesting data science work is done.

Timechop

Timechop does the necessary temporal math to set up temporal cross-validation. It 'chops' time according to config into train-test split definitions, which other components use.

Input

  • temporal_config in experiment config

Output

  • Time splits containing temporal cross-validation definition, including each as_of_date to be included in the matrices in each time split

Entity-Date Table Generator

The EntityDateTableGenerator manages entity-date tables (including cohort and subset tables) by running the configured query for a number of different as_of_dates. Alternately, will retrieve all unique entities and dates from the labels table if no query is configured.

Input

  • All unique as_of_dates needed by matrices in the experiment, as provided by Timechop
  • query and name from cohort_config or subsets in the scoring section in an experiment config
  • entity-date table name that the caller wants to use

Output

  • An entity-date table in the database, consisting of entity ids and dates

Label Generator

The LabelGenerator manages a labels table by running the configured label query for a number of different as_of_dates and label_timespans.

Input

  • All unique as_of_dates and label_timespans, needed by matrices in the experiment, as provided by Timechop
  • query and name from label_config in experiment config

Output

  • A labels table in the database, consisting of entity ids, dates, and boolean labels

Feature Generator

The FeatureGenerator manages a number of features tables by converting the configured feature_aggregations into collate.Spacetime objects, and then running the queries generated by collate. For each feature_aggregation, it runs a few passes:

  1. Optionally, convert a complex from object (e.g. the FROM part of the configured aggregation query) into an indexed table for speed.
  2. Create a number of empty tables at different GROUP BY levels (always entity_id in triage) and run inserts individually for each as_of_date. These inserts are split up into individual tasks and parallelized for speed.
  3. Roll up the GROUP BY tables from step 1 to the entity_id level with a single LEFT JOIN query.
  4. Use the cohort table to find all members of the cohort not present in the table from step 2 and create a new table with all members of the cohort, null values filled in with values based on the rules in the feature_aggregations config.

Input

  • All unique as_of_dates needed by matrices in the experiment, and the start time for features, as provided by Timechop
  • The populated cohort table, as provided by Entity-Date Table Generator
  • feature_aggregations in experiment config

Output

  • Populated feature tables in the database, one for each feature_aggregation

Feature Dictionary Creator

Summarizes the feature tables created by FeatureGenerator into a dictionary more easily usable for feature grouping and serialization purposes. Does this by querying the database's information_schema.

Input

  • Names of feature tables and the index of each table, as provided by Feature Generator

Output

  • A master feature dictionary, consisting of each populated feature table and all of its feature column names.

Feature Group Creator

Creates feature groups by taking the configured feature grouping rules and applying them to the master feature dictionary, to create a collection of smaller feature dictionaries.

Input

Output

  • List of feature dictionaries, each representing one feature group

Feature Group Mixer

Combines feature groups into new ones based on the configured rules (e.g. leave-one-out, leave-one-in).

Input

  • List of feature dictionaries, as provided by Feature Group Creator
  • feature_group_strategies in experiment config

Output

  • List of feature dictionaries, each representing one or more feature groups.

Planner

Mixes time split definitions and feature groups to create the master list of matrices that are required for modeling to proceed.

Input

  • List of feature dictionaries, as provided by Feature Group Mixer
  • List of matrix split definitions, as provided by Timechop
  • user_metadata, in experiment config
  • feature_start_time from temporal_config in experiment config
  • cohort name from cohort_config in experiment config
  • label name from cohort_config in experiment config

Output

  • List of serializable matrix build tasks, consisting of everything needed to build a single matrix:
    • list of as-of-dates
    • a label name
    • a label type
    • a feature dictionary
    • matrix uuid
    • matrix metadata
    • matrix type (train or test)

Matrix Builder

Takes matrix build tasks from the Planner and builds them if they don't already exist.

Input

  • A matrix build task, as provided by Planner
  • include_missing_labels_in_train_as from label_config in experiment config
  • The experiment's MatrixStorageEngine

Output

  • The built matrix saved in the MatrixStorageEngine
  • A row describing the matrix saved in the database's triage_metadata.matrices table.

ModelTrainTester

A meta-component of sorts. Encompasses all of the other catwalk components.

Input

Output

  • All of its components are run, resulting in trained models, predictions, evaluation metrics, and individual importances

ModelGrouper

Assigns a model group to each model based on its metadata.

Input

  • model_group_keys in experiment config
  • All the data about a particular model neded to decide a model group for the model: classifier name, hyperparameter list, and matrix metadata, as provided by ModelTrainer

Output

  • a model group id corresponding to a row in the triage_metadata.model_groups table, either a matching one that already existed in the table or one that it autoprovisioned.

ModelTrainer

Trains a model, stores it, and saves its metadata (including model group information and feature importances) to the database. Each model to be trained is expressed as a serializable task so that it can be parallelized.

Input

Output

  • a row in the database's triage_metadata.model_groups table, the triage_metadata.models table, and rows in train_results.feature_importances for each feature.
  • the trained model persisted in the ModelStorageEngine

Predictor

Generates predictions for a given model and matrix, both returning them for immediate use and saving them to the database.

Input

Output

  • The predictions as an array
  • Each prediction saved to the database, unless configured not to. The table they are stored in depends on which type of matrix it is (e.g. test_results.predictions or train_results.predictions)

Protected Group Table Generator

Generates a table containing protected group attributes (e.g. race, sex, age).

Input

  • A cohort table name and its configuration's unique hash
  • Bias audit configuration, specifically a from object (either a table or query), and column names in the from object for protected attributes, knowledge date, and entity id.
  • A name for the protected groups table

Output

  • A protected groups table, containing all rows from the cohort and any protected group information present in the from object, as well as the cohort hash so multiple cohorts can live in the same table.

ModelEvaluator

Generates evaluation metrics for a given model and matrix over the entire matrix and for any subsets.

Input

  • scoring in experiment config
  • array of predictions
  • the MatrixStore and model_id that the predictions were generated from
  • the subset to be evaluated (or None for the whole matrix)
  • the reference group and thresholding rules from bias_audit_config in experiment config
  • the protected group generator object (for retrieving protected group data)

Output

  • A row in the database for each evaluation metric for each subset. The table they are stored in depends on which type of matrix it is (e.g. test_results.evaluations or train_results.evaluations).
  • A row in the database for each Aequitas bias report. Either test_results.aequitas or train_results.aequitas.

Individual Importance Calculator

Generates the top n feature importances for each entity in a given model.

Input

  • individual_importance_config in experiment config.
  • model id
  • a MatrixStore object for a test matrix
  • an as-of-date

Output

  • rows in the test_results.individual_importances table for the model, date, and matrix based on the configured method and number of top features per entity.

General Class Design

The Experiment class is designed to have all work done by component objects that reside as attributes on the instance. The purpose of this is to maximize the reuse potential of the components outside of the Experiment, as well as avoid excessive class inheritance within the Experiment.

The inheritance tree of the Experiment is reserved for execution concerns, such as switching between singlethreaded, multiprocess, or cluster execution. To enable these different execution contexts without excessive duplicated code, the components that cover computationally or memory-intensive work generally implement methods to generate a collection of serializable tasks to perform later, on either that same object or perhaps another one running in another process or machine. The subclasses of Experiment then differentiate themselves by implementing methods to execute a collection of these tasks using their preferred method of execution, whether it be a simple loop, a process pool, or a cluster.

The components are created and experiment configuration is bound to them at Experiment construction time, so that the instance methods can have concise call signatures that only cover the information passed by other components mid-experiment.

Data reuse/replacement is handled within components. The Experiment generally just hands the replace flag to each component at object construction, and at runtime each component uses that and determines whether or not the needed work has already been done.

I'm trying to find some behavior. Where does it reside?

If you're looking to change behavior of the Experiment,

  • When possible, the logic resides in one of the components and hopefully the component list above should be helpful at finding the lines between components.
  • Logic that specifically relates to parallel execution is in one of the experiment subclasses (see parallelization section below).
  • Everything else is in the Experiment base class. This is where the public interface (.run()) resides, and follows a template method pattern to define the skeleton of the Experiment: instantating components based on experiment configuration and runtime inputs, and passing output from one component to another.

I want to add a new option. Where should I put it?

Generally, the experiment configuration is where any new options go that change any data science-related functionality; in other words, if you could conceivably get better precision from the change, it should make it into experiment configuration. This is so the hashed experiment config is meaningful and the experiment can be audited by looking at the experiment configuration rather than requiring the perusal of custom code. The blind spot in this is, of course, the state of the database, which can always change results, but it's useful for database state to continue to be the only exception to this rule.

On the other hand, new options that affect only runtime concerns (e.g. performance boosts) should go as arguments to the Experiment. For instance, changing the number of cores to use for matrix building, or telling it to skip predictions won't change the answer you're looking for; options like these just help you potentially get to the answer faster. Once an experiment is completed, runtime flags like these should be totally safe to ignore in analysis.

Storage Abstractions

Another important part of enabling different execution contexts is being able to pass large, persisted objects (e.g. matrices or models) by reference to another process or cluster. To achieve this, as well as provide the ability to configure different storage mediums (e.g. S3) and formats (e,g, HDF) without changes to the Experiment class, all references to these large objects within any components are handled through an abstraction layer.

Matrix Storage

All interactions with individual matrices and their bundled metadata are handled through MatrixStore objects. The storage medium is handled through a base Store object that is an attribute of the MatrixStore. The storage format is handled through inheritance on the MatrixStore: Each subclass, such as CSVMatrixStore or HDFMatrixStore, implements the necessary methods (save, load, head_of_matrix) to properly persist or load a matrix from its storage.

In addition, the MatrixStore provides a variety of methods to retrieve data from either the base matrix itself or its metadata. For instance (this is not meant to be a complete list):

  • matrix - the raw matrix
  • metadata - the raw metadata dictionary
  • exists - whether or not it exists in storage
  • columns - the column list
  • labels - the label column
  • uuid - the matrix's UUID
  • as_of_dates - the matrix's list of as-of-dates

One MatrixStorageEngine exists at the Experiment level, and roughly corresponds with a directory wherever matrices are stored. Its only interface is to provide a MatrixStore object given a matrix UUID.

Model Storage

Model storage is handled similarly to matrix storage, although the interactions with it are far simpler so there is no single-model class akin to the MatrixStore. One ModelStorageEngine exists at the Experiment level, configured with the Experiment's storage medium, and through it trained models can be saved or loaded. The ModelStorageEngine uses joblib to save and load compressed pickles of the model.

Miscellaneous Project Storage

Both the ModelStorageEngine and MatrixStorageEngine are based on a more general storage abstraction that is suitable for any other auxiliary objects (e.g. graph images) that need to be stored. That is the ProjectStorage object, which roughly corresponds to a directory on some storage medium where we store everything. One of these exists as an Experiment attribute, and its interface .get_store can be used to persist or load whatever is needed.

Parallelization/Subclassing Details

In the Class Design section above, we introduced tasks for parallelization and subclassing for execution changes. In this section, we expand on these to help provide a new guide to working with these.

Currently there are three methods that must be implemented by subclasses of Experiment in order to be fully functional.

Abstract Methods

  • process_query_tasks - Run feature generation queries. Receives a list of tasks. each task actually represents a table and is split into three lists of queries to enable the implementation to avoid deadlocks: prepare (table creation), inserts (a collection of INSERT INTO SELECT queries), and finalize (indexing). prepare needs to be run before the inserts and finalize is best run after the inserts, so it is advised that only the inserts are parallelized. The subclass should run each individual batch of queries by calling self.feature_generator.run_commands([list of queries]), which will run all of the queries serially, so the implementation can send a batch of queries to each worker instead of having each individual query be on a new worker.
  • process_matrix_build_tasks - Run matrix build tasks (that assume all the necessary label/cohort/feature tables have been built). Receives a dictionary of tasks. Each key is a matrix UUID, and each value is a dictionary that has all the necessary keyword arguments to call self.matrix_builder.build_matrix to build one matrix.
  • process_train_test_batches - Run model train/test task batches (that assume all matrices are built). Receives a list of triage.component.catwalk.TaskBatch objects, each of which has a list of tasks, a description of those tasks, and whether or not that batch is safe to run in parallel. Within this, each task is a dictionary that has all the necessary keyword arguments to call self.model_train_tester.process_task to train and test one model. Each task covers model training, prediction (on both test and train matrices), model evaluation (on both test and train matrices), and saving of global and individual feature importances.

Reference Implementations

  • SingleThreadedExperiment is a barebones implementation that runs everything serially.
  • MultiCoreExperiment utilizes local multiprocessing to run tasks through a worker pool. Reading this is helpful to see the minimal implementation needed for some parallelization.
  • RQExperiment - utilizes an RQ worker cluster to allow the tasks to be parallelized either locally or distributed to other. Does not take care of spawning a cluster or any other infrastructural concerns: it expects that the cluster is running somewhere and is reading from the same Redis instance that is passed to the RQExperiment. The RQExperiment simply enqueues tasks and waits for them to be completed. Reading this is helpful as a simple example of how to enable distributed computing.