This guide's purpose is to provide familiarity of the inner workings of a Triage Experiment to people with some experience in data science and Python. A Triage Experiment is a highly structured way of defining the experimentation phase of a data science project. To those wondering whether this Experiment structure is flexible enough to fit their needs, this should help.
First, the given temporal_config
section in the experiment definition is transformed into train and test splits,
including as_of_times
for each matrix.
We create these splits by figuring out the latest reasonable split time from the inputs, and moving backwards in time
at the rate of the given model_update_frequency
, until we get to the earliest reasonable split time.
For each split, we create as_of_times
by moving either backwards from the split time towards the
max_training_history
(for train matrices) or forwards from the split time towards the test_duration
(for test
matrices) at the provided data_frequency
.
Many of these configured values may be lists, in which case we generate the cross-product of all the possible values and generate more splits.
For a more detailed look at the temporal validation logic, see Temporal Validation Deep Dive.
The train and test splits themselves are not used until the Building Matrices section, but a
flat list of all computed as_of_times
for all matrices needed in the experiment is used in the next section,
Transforming Data.
With all of the as_of_times
for this Experiment now computed, it's now possible to transform the input data into
features and labels as of all the required times.
The Experiment populates a 'labels' table using the following input:
-
A query, provided by the user in the configuration file, that generates entity_ids and outcomes for a given as_of_date and label_timespan.
-
Each as_of_date and label_timespan defined in temporal config
For instance, an inspections-style query (for the given timespan, return the entity and outcome of any matching inspections) would look like:
select
events.entity_id,
bool_or(outcome::bool)::integer as outcome
from events
where '{as_of_date}' <= outcome_date
and outcome_date < '{as_of_date}'::timestamp + interval '{label_timespan}'
group by entity_id
This binary labels table is scoped to the entire Experiment, so all as_of_time
(computed in step 1) and
label_timespan
(taken straight from temporal_config
) combinations are present. Additionally, the 'label_name' and
'label_type' are also recorded with each row in the table.
The name of the labels table is based on both the name of the label and a hash of the label query (e.g labels_failedviolation_a0b1c2d3
), so any prior experiments that shared both the name and query will be able to reuse the labels table. If the 'replace' flag was sent, for each as_of_time
and label_timespan
, the labels table is queried to check if any rows exist that match. If any such rows exist, the labels query for that date and timespan is not run.
At this point, the 'labels' table may not have entries for all entities and dates that need to be in a given matrix.
How these rows have their labels represented is up to the configured include_missing_labels_in_train_as
value in the
experiment. This value is not processed when we generate the labels table, but later on when the matrix is built (see
'Retrieving Data and Saving Completed Matrix')
The Experiment keeps track of the which entities are in the cohort on any given date. Similarly to the labels table, the experiment populates a cohort table. using one of two options:
-
A query, provided by the user in the configuration file, that generates entity_ids for a given as_of_date. This is run for each as_of_date as generated from the temporal config.
-
All distinct entity ids and as_of_dates in the labels table, if no query is provided by the user in the configuration file.
This cohort table is scoped to the entire Experiment, so all as_of_times
(computed in step 1) are present.
The name of the cohort table is based on both the name of the cohort and a hash of the cohort query (e.g cohort_permitted_a0b1c2d3
), so any prior experiments that shared both the name and query will be able to reuse the cohort table. If the 'replace' flag was sent, for each as_of_time
, the cohort table is queried to check if any rows exist that match. If any such rows exist, the cohort query for that date is not run.
Each provided feature_aggregation
configures the creation and population of several feature tables in the 'features'
schema: one for each of the groups specified in the config, one that merges the groups together into one table, and one
that fills in null values from the merged table with imputed values based on imputation config.
To generate the SQL that creates the pre-imputation table, the Experiment assembles building blocks from the feature
aggregation config, as well as the experiment's list of as_of_times
:
from_obj
represents, well, the object of the FROM clause in the SQL query. Often this is just a table, but can be configured to be a subquery. This holds all the data that we want to aggregate into features- Each
as_of_time
in the experiment andinterval
in thefeature_aggregation
is combined with theknowledge_date_column
to create a WHERE clause representing a valid window of events to aggregate in thefrom_obj
: e.g (where {knowledge_date_column} >= {as_of_time} - interval {interval}
) - Each
aggregate
,categorical
, orarray_categorical
represents a SELECT clause. For aggregates, thequantity
is a column or SQL expression representing a numeric quantity present in thefrom_obj
, and themetrics
are any number of aggregate functions we want to use. The aggregate function is applied to the quantity. - By default the query is joined with the cohort table to remove unnecessary rows. If
features_ignore_cohort
is passed to the Experiment this is not done.
So a simplified version of a typical query would look like:
SELECT {group}, {metric}({quantity})
FROM {from_obj}
JOIN {cohort_table} ON (
{cohort_table.entity_id} = {from_obj.entity_id}
AND {cohort_table.date} = {as_of_time}
)
WHERE {knowledge_date_column} >= {as_of_time} - interval {interval}
GROUP BY entity_id
For each as_of_time
, the results from the generated query are written to a table whose name is prefixed with the
prefix
, and suffixed with the group
(always entity_id
in triage
). The table is keyed on this grouping column plus the as_of_date.
Each generated group table is combined into one representing the whole aggregation with a left join to ensure all valid entity/date pairs are included (allowing for identification of nulls requiring imputation). This aggregation-level table represents all of the features
in the aggregation, pre-imputation. Its output location is generally {prefix}_aggregation
A table that looks similar, but with imputed values is created. The cohort table from above is passed into collate as
the comprehensive set of entities and dates for which output should be generated, regardless if they exist in the
from_obj
. Each feature column has an imputation rule, inherited from some level of the feature definition. The
imputation rules that are based on data (e.g. mean
) use the rows from the as_of_time
to produce the imputed value.
In addition, each column that needs imputation has an imputation flag column created, which contains a boolean flagging which rows were imputed or not. Since the values of these columns are redundant for most aggregate functions that look at a given timespan's worth of data (they will be imputed only if zero events in their timespan are seen), only one imputation flag column per timespan is created. An exception to this are some statistical functions that require not one, but two values, like standard deviation and variance. These boolean imputation flags are not merged in with the others.
Its output location is generally {prefix}_aggregation_imputed
At this point, we have at least three tables that are used to populate matrices:
labels_{labelname}_{labelqueryhash}
with computed labels for each datecohort_{cohortname}_{cohortqueryhash}
with the cohort for each date- A
features.{prefix}_aggregation_imputed
table for each feature aggregation present in the experiment config.
At this point, we have to build actual train and test matrices that can be processed by machine learning algorithms, save at the user's specified path, either on the local filesystem or s3 depending on the scheme portion of the path (e.g. s3://bucket-name/project_directory
)
First we have to figure out exactly what matrices we have to build. The split definitions from step 1 are a good start -- they are our train and test splits -- but sometimes we also want to test different subsets of the data, like feature groups (e.g. 'how does using group of features A perform against using all features?'). So there's a layer of iteration we introduce for each split, that may produce many more matrices.
What do we iterate over?
- Feature List - All subsets of features that the user wants to cycle through. This is the end result of the feature group generation and mixing process, which is described more below.
- Cohorts - In theory we can take in different cohorts and iterate in the same experiment. This is not fully implemented, so in reality we just use the one cohort that is passed in the
cohort_config
- Label names - In theory we can take in different labels (e.g. complaints, sustained complaints) in the same experiment. Right now there is no support for multiple label names, but the label name used is configurable through the optional 'label_config'->'name' config value
- Label types - In theory we can take in different label types (e.g. binary) in the same experiment. Right now this isn't done, there is one label type and it is hardcoded as 'binary'.
How do we arrive at the feature lists? There are two pieces of config that are used: feature group_definition
and
feature_group_strategies
. Feature group definitions are just ways to define logical blocks of features, most often
features that come from the same source, or describing a particular type of event. These groups within the experiment
as a list of feature names, representing some subset of all potential features for the experiment. Feature group
strategies are ways to take feature groups and mix them together in various ways. The feature group strategies take
these subsets of features and convert them into another list of subsets of features, which is the final list iterated
over to create different matrices.
Feature groups, at present, can be defined as either a prefix
(the prefix of the feature name), a table
(the
feature table that the feature resides in), or all
(all features). Each argument is passed as a list, and each entry
in the list is interpreted as a group. So, a feature group config of {'table': ['complaints_aggregate_imputed', 'incidents_aggregate_imputed']}
would result in two feature groups: one with all the features in
complaints_aggregate_imputed
, and one with all the features in incidents_aggregate_imputed
. Note that this requires
a bit of knowledge on the user's part of how the feature table names will be constructed.
prefix
works on the prefix of the feature name as it exists in the database. So this also requires some knowledge of
how these get created. The general format is: {aggregation_prefix}_{group}_{timeperiod}_{quantity}
, so with some
knowledge the user can create groups with the aggregation's configured prefix (common), or the aggregations configured
prefix + group (in case they want to compare, for instance, zip-code level features versus entity level features).
all
, with a single value of True
, will include a feature group with all defined features. If no feature group
definition is sent, this is the default.
Either way, at the end of this process the experiment will be aware of some list of feature groups, even if the list is just length 1 with all features as one group.
A few basic feature group mixing strategies are implemented: leave-one-in
, leave-one-out
, and all
. These are sent
in the experiment definition as a list, so different strategies can be tried in the same experiment. Each included
strategy will be applied to the list of feature groups from the previous step, to convert them into
For instance, 'leave-one-in' will cycle through each feature group, and for each one create a list of features that
just represents that feature group, so for some matrices we would only use features from that particular group.
leave-one-out
does the opposite, for each feature group creating a list of features that includes all other feature
groups but that one. all
just creates a list of features that represents all feature groups together.
At this point, matrices are created by looping through all train/test splits and data subsets (e.g. feature groups, state definitions), grabbing the data corresponding to each from the database, and assembling that data into a design matrix that is saved along with the metadata that defines it.
As an example, if the experiment defines 3 train/test splits (one test per train in this example, for simplicity), 3 feature groups that are mixed using the 'leave-one-out' and 'all' strategies, and 1 state definition, we'll expect 18 matrices to be saved: 9 splits after multiplying the time splits by the feature groups, and each one creating a train and test matrix.
After all matrices for the Experiment are defined but before any are built, the Experiment is associated with each Matrix in the database through the triage_metadata.experiment_matrices
table. This means that whether or not the Experiment has to end up building a matrix, after the fact a user can query the database to see if it used said matrix.
Each matrix that has to be built (i.e. has not been built by some prior experiment) is built by retrieving its data out of the database.
How do we get the data for an individual matrix out of the database?
- Create an entity-date table for this specific matrix. There is some logic applied to decide what rows show up. There are two possible sets of rows that could show up.
-
all valid entity dates
. These dates come from the entity-date-state table for the experiment (populated using the rules defined in the 'cohort_config'), filtered down to the entity-date pairs that match both the state filter and the list of as-of-dates for this matrix. -
all labeled entity dates
. These dates consist of all the valid entity dates from above, that also have an entry in the labels table.
If the matrix is a test matrix, all valid entity dates will be present.
If the matrix is a train matrix, whether or not valid but unlabeled examples show up is decided by the
include_missing_labels_in_train_as
configuration value. If it is present in any form, these labels will be in the
matrix. Otherwise, they will be filtered out.
-
Write features data from tables to disk in CSV format using a COPY command, table by table. Each table is joined with the matrix-specific entity-date table to only include the desired rows.
-
Write labels data to disk in CSV format using a COPY command. These labels will consist of the rows in the matrix-specific entity-date table left joined to the labels table. Rows not present in the labels table will have their label filled in (either True or False) based on the value of the
include_missing_labels_in_train_as
configuration key. -
Merge the features and labels CSV files horizontally, in pandas. They are expected to be of the same shape, which is enforced by the entity-date table. The resulting matrix is indexed on
entity_id
andas_of_date
, and then saved to disk (in CSV format, more formats to come) along with its metadata: time, feature, label, index, and state information. along with any user metadata the experiment config specified. The filename is decided by a hash of this metadata, and the metadata is saved in a YAML file with the same hash and directory. The metadata is additionally added to a database table 'matrices'.
Matrix metadata reference:
At this point, all finished matrices and metadata will be saved under the project_path
supplied by the user to the
Experiment constructor, in the subdirectory matrices
.
The last phase of an Experiment run uses the completed design matrices to train, test, and evaluate classifiers. This procedure writes a lot of metadata to the 3 schemas: 'triage_metadata', 'train_results', and 'test_results'.
Every combination of training matrix + classifier + hyperparameter is considered a Model. Before any Models are trained, the Experiment is associated with each Model in the database through the triage_metadata.experiment_models
table. This means that whether or not the Experiment has to end up training a model, after the fact a user can query the database to see if it used said model.
Each matrix marked for training is sent through the configured grid in the experiment's grid_config
. This works much
like the scikit-learn ParameterGrid
(and in fact uses it on the backend). It cycles through all of the classifiers
and hyperparameter combinations contained herein, and calls .fit()
with that train matrix. Any classifier that
adheres to the scikit-learn .fit/.transform
interface and is available in the Python environment will work here,
whether it is a standard scikit-learn classifier, a third-party library like XGBoost, or a custom-built one in the
calling repository (for instance, one that implements the problem domain's baseline heuristic algorithm for
comparison). Metadata about the trained classifier is written to the triage_metadata.models
Postgres table. The trained model is saved to a filename with the model hash (see Model Hash section below).
Each model is assigned a 'model group'. A model group represents a number of trained classifiers that we want to treat
as equivalent by some criteria. By default, this is aimed at defining models which are equivalent across time splits,
to make analyzing model stability easier. This default is accomplished with a set of 'model group keys' that includes
data about the classifier (module, hyperparameters), temporal intervals used to create the train matrix (label
timespan, training history, as-of-date frequency), and metadata describing the data in the train matrix (features and
feature groups, label name, cohort name). The user can override this set of model_group_keys
in the experiment
definition, with all of the default information plus other matrix metadata at their disposal (See end of 'Retrieving
Data and Saving Completed Matrix' section for more about matrix metadata). This data is stored in the triage_metadata.model_groups
table, along with a model_group_id
that is used as a foreign key in the triage_metadata.models
table.
Each trained model is assigned a hash, for the purpose of uniquely defining and caching the model. This hash is based
on the training matrix metadata, classifier path, hyperparameters (except those which concern execution and do not
affect results of the classifier, such as n_jobs
), and the given project path for the Experiment. This hash can be
found in each row of the triage_metadata.models
table. It is enforced as a unique key in the table.
The training phase also writes global feature importances to the database, in the train_results.feature_importances
table.
A few methods are queried to attempt to compute feature importances:
- The bulk of these are computed using the trained model's
.feature_importances_
attribute, if it exists. - For sklearn's
SVC
models with a linear kernel, the model's.coef_.squeeze()
is used. - For sklearn's LogisticRegression models,
np.exp(model.coef_).squeeze()
is used. - Otherwise, no feature importances are written.
For each test matrix, predictions, individual importances, and the user-specified testing evaluation metrics are written to the 'test_results' schema. For each train matrix, predictions and the user-specified training evaluation metrics are written to the 'train_results' schema.
The trained model's prediction probabilities (predict_proba()
) are computed both for the matrix it was trained on and any testing matrices. The predictions for the training matrix are saved in train_results.predictions
and those for the testing matrices are saved in the test_results.predictions
. More specifically, predict_proba
returns the probabilities for each label (false and true), but in this case only the probabilities for the true label are saved in the {train or test}_predictions
table. The entity_id
and as_of_date
are retrieved from the matrix's index, and stored in the database table along with the probability score, label value (if it has one), as well as other metadata.
Feature importances (of a configurable number of top features, defaulting to 5) for each prediction are computed and written to the test_results.individual_importances
table. Right now, there are no sophisticated calculation methods integrated into the experiment; simply the top 5 global feature importances for the model are copied to the individual_importances
table.
Triage allows for the computation of both testing set and training set evaluation metrics. Evaluation metrics, such as precision and recall at various thresholds, are written to either the train_results.evaluations
table or the test_results.evaluations
. Triage defines a number of Evaluation Metrics metrics that can be addressed by name in the experiment definition, along with a list of thresholds and/or other parameters (such as the 'beta' value for fbeta) to iterate through.
Thresholding is done either via absolute value (top k) or percentile by sorting the predictions and labels by the row's predicted probability score, with ties broken in some way (see next paragraph), and assigning the predicted value as True for those above the threshold. Note that the percentile thresholds are in terms of the population percentage, not a cutoff threshold for the predicted probability.
A few different versions of tiebreaking are implemented to deal with the nuances of thresholding, and each result is written to the evaluations table for each metric score, along with some related statistics:
worst_value
- Ordering by the label ascending. This has the effect of as many predicted negatives making it above thresholds as possible, thus producing the worst possible score.best_value
- Ordering by the label descending. This has the effect of as many predicted positives making it above thresholds as possible, thus producing the best possible score.stochastic_value
- If theworst_value
andbest_value
are not the same (as defined by the floating point tolerance at catwalk.evaluation.RELATIVE_TOLERANCE), the sorting/thresholding/evaluation will be redone many times, and the mean of all these trials is written to this column. Otherwise, theworst_value
is written herenum_sort_trials
- If trials are needed to produce thestochastic_value
, the number of trials taken is written here. Otherwise this will be 0standard_deviation
- If trials are needed to produce thestochastic_value
, the standard deviation of these trials is written here. Otherwise this will be 0
Sometimes test matrices may not have labels for every row, so it's worth mentioning here how that is handled and interacts with thresholding. Rows with missing labels are not considered in the metric calculations, and if some of these rows are in the top k of the test matrix, no more rows are taken from the rest of the list for consideration. So if the experiment is calculating precision at the top 100 rows, and 40 of the top 100 rows are missing a label, the precision will actually be calculated on the 60 of the top 100 rows that do have a label. To make the results of this more transparent for users, a few extra pieces of metadata are written to the evaluations table for each metric score.
num_labeled_examples
- The number of rows in the test matrix that have labelsnum_labeled_above_threshold
- The number of rows above the configured threshold for this metric score that have labelsnum_positive_labels
- The number of positive labels in the test matrix
Triage supports performing a bias audit using the Aequitas library, if a bias_audit_config
is passed in configuration. This is handled first through creating a 'protected groups'table which retrieves the configured protected group information for each member of the cohort, and the time that this protected group information was first known. This table is named using a hash of the bias audit configuration, so data can be reused across experiments as long as the bias configuration does not change.
A bias audit is performed alongside metric calculation time for each model that is built, on both the train and test matrices, and each subset. This is very similar to the evaluations table schema, in that for each slice of data that has evaluation metrics generated for it, also receives a bias audit. The change is that thresholds are not borrowed from the evaluation configuration, as aequitas audits are computationally expensive and large threshold grids are common in Triage experiments; the bias audit has its evaluation thresholds configured in the bias_audit_config
. All data from the bias audit is saved to either the train_results.aequitas
or test_results.aequitas
tables.
Triage also supports evaluating a model on a subset of the predictions made.
This is done by passing a subset query in the prediction config. The model
evaluator will then subset the predictions on valid entity-date pairs for the
given model and will calculate metrics for the subset, re-applying thresholds
as necessary to the predictions in the subset. Subset definitions are stored in
the triage_metadata.subsets
table, and the evaluations are stored in the
evaluations
tables. A hash of the subset configuration identifies subset
evaluations and links the subsets
table.
At this point, the 'triage_metadata', 'train_results', and 'test_results' database schemas are fully populated with data about models, model groups, predictions, feature importances, and evaluation metrics for the researcher to query. In addition, the trained model pickle files are saved in the configured project path. The experiment is considered finished.