This document is aimed at people wishing to contribute to Triage development. It explains the design and architecture of the Experiment class.
For a general overview of how the parts of an experiment depend on each other, refer to the graphs below.
The FeatureGenerator section above hides some details to make the overall graph flow more concise. To support feature grouping, there are more operations that happen between feature table creation and matrix building. The relevant section of the dependency graph is expanded below, along with the output that each pair of components sends between each other within the arrow
These are where the interesting data science work is done.
- Timechop (temporal cross-validation)
- Architect (design matrix creation)
- Catwalk (modeling)
Timechop does the necessary temporal math to set up temporal cross-validation. It 'chops' time according to config into train-test split definitions, which other components use.
Input
temporal_config
in experiment config
Output
- Time splits containing temporal cross-validation definition, including each
as_of_date
to be included in the matrices in each time split
The EntityDateTableGenerator
manages entity-date tables (including cohort and subset tables) by running the configured query for a number of different as_of_dates
. Alternately, will retrieve all unique entities and dates from the labels table if no query is configured.
Input
- All unique
as_of_dates
needed by matrices in the experiment, as provided by Timechop - query and name from
cohort_config
orsubsets
in thescoring
section in an experiment config - entity-date table name that the caller wants to use
Output
- An entity-date table in the database, consisting of entity ids and dates
The LabelGenerator
manages a labels table by running the configured label query for a number of different as_of_dates
and label_timespans
.
Input
- All unique
as_of_dates
andlabel_timespans
, needed by matrices in the experiment, as provided by Timechop - query and name from
label_config
in experiment config
Output
- A labels table in the database, consisting of entity ids, dates, and boolean labels
The FeatureGenerator
manages a number of features tables by converting the configured feature_aggregations
into collate.Spacetime
objects, and then running the queries generated by collate
. For each feature_aggregation
, it runs a few passes:
- Optionally, convert a complex from object (e.g. the
FROM
part of the configured aggregation query) into an indexed table for speed. - Create a number of empty tables at different
GROUP BY
levels (alwaysentity_id
intriage
) and run inserts individually for eachas_of_date
. These inserts are split up into individual tasks and parallelized for speed. - Roll up the
GROUP BY
tables from step 1 to theentity_id
level with a singleLEFT JOIN
query. - Use the cohort table to find all members of the cohort not present in the table from step 2 and create a new table with all members of the cohort, null values filled in with values based on the rules in the
feature_aggregations
config.
Input
- All unique
as_of_dates
needed by matrices in the experiment, and the start time for features, as provided by Timechop - The populated cohort table, as provided by Entity-Date Table Generator
feature_aggregations
in experiment config
Output
- Populated feature tables in the database, one for each
feature_aggregation
Summarizes the feature tables created by FeatureGenerator into a dictionary more easily usable for feature grouping and serialization purposes. Does this by querying the database's information_schema
.
Input
- Names of feature tables and the index of each table, as provided by Feature Generator
Output
- A master feature dictionary, consisting of each populated feature table and all of its feature column names.
Creates feature groups by taking the configured feature grouping rules and applying them to the master feature dictionary, to create a collection of smaller feature dictionaries.
Input
- Master feature dictionary, as provided by Feature Dictionary Creator
feature_group_definition
in experiment config
Output
- List of feature dictionaries, each representing one feature group
Combines feature groups into new ones based on the configured rules (e.g. leave-one-out
, leave-one-in
).
Input
- List of feature dictionaries, as provided by Feature Group Creator
feature_group_strategies
in experiment config
Output
- List of feature dictionaries, each representing one or more feature groups.
Mixes time split definitions and feature groups to create the master list of matrices that are required for modeling to proceed.
Input
- List of feature dictionaries, as provided by Feature Group Mixer
- List of matrix split definitions, as provided by Timechop
user_metadata
, in experiment configfeature_start_time
fromtemporal_config
in experiment config- cohort name from
cohort_config
in experiment config - label name from
cohort_config
in experiment config
Output
- List of serializable matrix build tasks, consisting of everything needed to build a single matrix:
- list of as-of-dates
- a label name
- a label type
- a feature dictionary
- matrix uuid
- matrix metadata
- matrix type (train or test)
Takes matrix build tasks from the Planner and builds them if they don't already exist.
Input
- A matrix build task, as provided by Planner
include_missing_labels_in_train_as
fromlabel_config
in experiment config- The experiment's MatrixStorageEngine
Output
- The built matrix saved in the MatrixStorageEngine
- A row describing the matrix saved in the database's
triage_metadata.matrices
table.
A meta-component of sorts. Encompasses all of the other catwalk components.
Input
- One temporal split, as provided by Timechop
grid_config
in experiment config- Fully configured ModelTrainer, Predictor, ModelEvaluator, Individual Importance Calculator objects
Output
- All of its components are run, resulting in trained models, predictions, evaluation metrics, and individual importances
Assigns a model group
to each model based on its metadata.
Input
model_group_keys
in experiment config- All the data about a particular model neded to decide a model group for the model: classifier name, hyperparameter list, and matrix metadata, as provided by ModelTrainer
Output
- a model group id corresponding to a row in the
triage_metadata.model_groups
table, either a matching one that already existed in the table or one that it autoprovisioned.
Trains a model, stores it, and saves its metadata (including model group information and feature importances) to the database. Each model to be trained is expressed as a serializable task so that it can be parallelized.
Input
- an instance of the ModelGrouper class.
- the experiment's ModelStorageEngine
- a MatrixStore object
- an importable classifier path and a set of hyperparameters
Output
- a row in the database's
triage_metadata.model_groups
table, thetriage_metadata.models
table, and rows intrain_results.feature_importances
for each feature. - the trained model persisted in the ModelStorageEngine
Generates predictions for a given model and matrix, both returning them for immediate use and saving them to the database.
Input
- The experiment's Model Storage Engine
- A model id corresponding to a row from the database
- A MatrixStore object
Output
- The predictions as an array
- Each prediction saved to the database, unless configured not to. The table they are stored in depends on which type of matrix it is (e.g.
test_results.predictions
ortrain_results.predictions
)
Generates a table containing protected group attributes (e.g. race, sex, age).
Input
- A cohort table name and its configuration's unique hash
- Bias audit configuration, specifically a from object (either a table or query), and column names in the from object for protected attributes, knowledge date, and entity id.
- A name for the protected groups table
Output
- A protected groups table, containing all rows from the cohort and any protected group information present in the from object, as well as the cohort hash so multiple cohorts can live in the same table.
Generates evaluation metrics for a given model and matrix over the entire matrix and for any subsets.
Input
scoring
in experiment config- array of predictions
- the MatrixStore and model_id that the predictions were generated from
- the subset to be evaluated (or
None
for the whole matrix) - the reference group and thresholding rules from
bias_audit_config
in experiment config - the protected group generator object (for retrieving protected group data)
Output
- A row in the database for each evaluation metric for each subset. The table they are stored in depends on which type of matrix it is (e.g.
test_results.evaluations
ortrain_results.evaluations
). - A row in the database for each Aequitas bias report. Either
test_results.aequitas
ortrain_results.aequitas
.
Generates the top n
feature importances for each entity in a given model.
Input
individual_importance_config
in experiment config.- model id
- a MatrixStore object for a test matrix
- an as-of-date
Output
- rows in the
test_results.individual_importances
table for the model, date, and matrix based on the configured method and number of top features per entity.
The Experiment class is designed to have all work done by component objects that reside as attributes on the instance. The purpose of this is to maximize the reuse potential of the components outside of the Experiment, as well as avoid excessive class inheritance within the Experiment.
The inheritance tree of the Experiment is reserved for execution concerns, such as switching between singlethreaded, multiprocess, or cluster execution. To enable these different execution contexts without excessive duplicated code, the components that cover computationally or memory-intensive work generally implement methods to generate a collection of serializable tasks
to perform later, on either that same object or perhaps another one running in another process or machine. The subclasses of Experiment then differentiate themselves by implementing methods to execute a collection of these tasks
using their preferred method of execution, whether it be a simple loop, a process pool, or a cluster.
The components are created and experiment configuration is bound to them at Experiment construction time, so that the instance methods can have concise call signatures that only cover the information passed by other components mid-experiment.
Data reuse/replacement is handled within components. The Experiment generally just hands the replace
flag to each component at object construction, and at runtime each component uses that and determines whether or not the needed work has already been done.
If you're looking to change behavior of the Experiment,
- When possible, the logic resides in one of the components and hopefully the component list above should be helpful at finding the lines between components.
- Logic that specifically relates to parallel execution is in one of the experiment subclasses (see parallelization section below).
- Everything else is in the Experiment base class. This is where the public interface (
.run()
) resides, and follows a template method pattern to define the skeleton of the Experiment: instantating components based on experiment configuration and runtime inputs, and passing output from one component to another.
Generally, the experiment configuration is where any new options go that change any data science-related functionality; in other words, if you could conceivably get better precision from the change, it should make it into experiment configuration. This is so the hashed experiment config is meaningful and the experiment can be audited by looking at the experiment configuration rather than requiring the perusal of custom code. The blind spot in this is, of course, the state of the database, which can always change results, but it's useful for database state to continue to be the only exception to this rule.
On the other hand, new options that affect only runtime concerns (e.g. performance boosts) should go as arguments to the Experiment. For instance, changing the number of cores to use for matrix building, or telling it to skip predictions won't change the answer you're looking for; options like these just help you potentially get to the answer faster. Once an experiment is completed, runtime flags like these should be totally safe to ignore in analysis.
Another important part of enabling different execution contexts is being able to pass large, persisted objects (e.g. matrices or models) by reference to another process or cluster. To achieve this, as well as provide the ability to configure different storage mediums (e.g. S3) and formats (e,g, HDF) without changes to the Experiment class, all references to these large objects within any components are handled through an abstraction layer.
All interactions with individual matrices and their bundled metadata are handled through MatrixStore
objects. The storage medium is handled through a base Store
object that is an attribute of the MatrixStore
. The storage format is handled through inheritance on the MatrixStore
: Each subclass, such as CSVMatrixStore
or HDFMatrixStore
, implements the necessary methods (save
, load
, head_of_matrix
) to properly persist or load a matrix from its storage.
In addition, the MatrixStore
provides a variety of methods to retrieve data from either the base matrix itself or its metadata. For instance (this is not meant to be a complete list):
matrix
- the raw matrixmetadata
- the raw metadata dictionaryexists
- whether or not it exists in storagecolumns
- the column listlabels
- the label columnuuid
- the matrix's UUIDas_of_dates
- the matrix's list of as-of-dates
One MatrixStorageEngine
exists at the Experiment level, and roughly corresponds with a directory wherever matrices are stored. Its only interface is to provide a MatrixStore
object given a matrix UUID.
Model storage is handled similarly to matrix storage, although the interactions with it are far simpler so there is no single-model class akin to the MatrixStore
. One ModelStorageEngine
exists at the Experiment level, configured with the Experiment's storage medium, and through it trained models can be saved or loaded. The ModelStorageEngine
uses joblib to save and load compressed pickles of the model.
Both the ModelStorageEngine
and MatrixStorageEngine
are based on a more general storage abstraction that is suitable for any other auxiliary objects (e.g. graph images) that need to be stored. That is the ProjectStorage
object, which roughly corresponds to a directory on some storage medium where we store everything. One of these exists as an Experiment attribute, and its interface .get_store
can be used to persist or load whatever is needed.
In the Class Design section above, we introduced tasks for parallelization and subclassing for execution changes. In this section, we expand on these to help provide a new guide to working with these.
Currently there are three methods that must be implemented by subclasses of Experiment in order to be fully functional.
process_query_tasks
- Run feature generation queries. Receives a list of tasks. eachtask
actually represents a table and is split into three lists of queries to enable the implementation to avoid deadlocks:prepare
(table creation),inserts
(a collection of INSERT INTO SELECT queries), andfinalize
(indexing).prepare
needs to be run before the inserts andfinalize
is best run after the inserts, so it is advised that only the inserts are parallelized. The subclass should run each individual batch of queries by callingself.feature_generator.run_commands([list of queries])
, which will run all of the queries serially, so the implementation can send a batch of queries to each worker instead of having each individual query be on a new worker.process_matrix_build_tasks
- Run matrix build tasks (that assume all the necessary label/cohort/feature tables have been built). Receives a dictionary of tasks. Each key is a matrix UUID, and each value is a dictionary that has all the necessary keyword arguments to callself.matrix_builder.build_matrix
to build one matrix.process_train_test_batches
- Run model train/test task batches (that assume all matrices are built). Receives a list oftriage.component.catwalk.TaskBatch
objects, each of which has a list of tasks, a description of those tasks, and whether or not that batch is safe to run in parallel. Within this, each task is a dictionary that has all the necessary keyword arguments to callself.model_train_tester.process_task
to train and test one model. Each task covers model training, prediction (on both test and train matrices), model evaluation (on both test and train matrices), and saving of global and individual feature importances.
- SingleThreadedExperiment is a barebones implementation that runs everything serially.
- MultiCoreExperiment utilizes local multiprocessing to run tasks through a worker pool. Reading this is helpful to see the minimal implementation needed for some parallelization.
- RQExperiment - utilizes an RQ worker cluster to allow the tasks to be parallelized either locally or distributed to other. Does not take care of spawning a cluster or any other infrastructural concerns: it expects that the cluster is running somewhere and is reading from the same Redis instance that is passed to the
RQExperiment
. TheRQExperiment
simply enqueues tasks and waits for them to be completed. Reading this is helpful as a simple example of how to enable distributed computing.