Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add missigness abstraction, build and evaluate causal model with missigness #136

Closed
wants to merge 74 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
2e522e2
modeling in progress
rfl-urbaniak Jun 13, 2024
e5e32af
first pass at analysis
rfl-urbaniak Jun 14, 2024
2479e6e
first linear model
rfl-urbaniak Jun 14, 2024
d8ef493
allow null predictor groups in SimpleLinear
rfl-urbaniak Jun 17, 2024
6fd838e
refactoring zoning
rfl-urbaniak Jun 18, 2024
e3459d5
simple linear with tests
rfl-urbaniak Jun 18, 2024
703f8e2
left shape padding of objects_cat_weighted added
rfl-urbaniak Jun 18, 2024
d2653b1
tested on extraneous data
rfl-urbaniak Jun 19, 2024
7af6a82
debugging eval wip
rfl-urbaniak Jun 21, 2024
b339512
evvaluation
rfl-urbaniak Jun 21, 2024
e4db7ef
clean zoning data notebook
rfl-urbaniak Jun 21, 2024
4eaaa9a
zoning model
rfl-urbaniak Jun 21, 2024
3bd1365
refactored zoning notebooks
rfl-urbaniak Jun 21, 2024
a5ed053
format lint
rfl-urbaniak Jun 21, 2024
e68d51e
todo about mypy in simple_linear
rfl-urbaniak Jun 21, 2024
74d6be0
interactions done
rfl-urbaniak Jun 25, 2024
e3a2e89
cleaned up
rfl-urbaniak Jun 26, 2024
411989c
Andy's comments
rfl-urbaniak Jun 27, 2024
b559c04
dealing with mwc in progress
rfl-urbaniak Jun 27, 2024
9e604e8
predictive with mwc working
rfl-urbaniak Jun 27, 2024
26a9eb1
register input
rfl-urbaniak Jun 28, 2024
64788ab
categorical WIP
rfl-urbaniak Jun 28, 2024
5b957a1
change vis of interactions
rfl-urbaniak Jun 28, 2024
85b4b2e
cleanup
rfl-urbaniak Jun 28, 2024
a3f91a9
Merge pull request #125 from BasisResearch/ru-simplify-plate-behavior
rfl-urbaniak Jun 28, 2024
e709156
changed chirho to github source
rfl-urbaniak Jun 28, 2024
3e3fe1a
format, test change chirho ref in setup
rfl-urbaniak Jun 28, 2024
1c1801e
cleanup
rfl-urbaniak Jul 1, 2024
195b199
bump pyro to 1.8.6 as required by chirho
rfl-urbaniak Jul 1, 2024
a86ef65
suspend mypy lines in simple_linear
rfl-urbaniak Jul 1, 2024
1a8c082
typo in data processing
rfl-urbaniak Jul 1, 2024
d197272
started handler, WIP
rfl-urbaniak Jul 1, 2024
523902f
adding categorical contributions WIP
rfl-urbaniak Jul 2, 2024
f749ddc
adding continuous contributions to child
rfl-urbaniak Jul 2, 2024
6b93915
first test for AddCausalLayer
rfl-urbaniak Jul 2, 2024
1aaec63
format lint
rfl-urbaniak Jul 2, 2024
5c811bb
usal layers WIP
rfl-urbaniak Jul 3, 2024
ae85d5d
register input revisions in progress
rfl-urbaniak Jul 3, 2024
d4b4a94
format lint
rfl-urbaniak Jul 3, 2024
9edb259
viz
rfl-urbaniak Jul 5, 2024
c8f3989
units model WIP
rfl-urbaniak Jul 5, 2024
d83f559
simplified continuous contributions
rfl-urbaniak Jul 5, 2024
f1ba72e
wip
rfl-urbaniak Jul 5, 2024
65b7116
continuous predictors work
rfl-urbaniak Jul 6, 2024
e809983
categorical working
rfl-urbaniak Jul 6, 2024
53b681c
factored out adding linear component
rfl-urbaniak Jul 6, 2024
7164f06
full model structure
rfl-urbaniak Jul 7, 2024
5d749fc
evaluation for unit model
rfl-urbaniak Jul 7, 2024
4ea4d1d
zoning model with viz
rfl-urbaniak Jul 8, 2024
edd7731
zoning causal model notebook
rfl-urbaniak Jul 9, 2024
ad0f78a
dags and gitignore cleanup
rfl-urbaniak Jul 9, 2024
6f6c6f7
notebook revised
rfl-urbaniak Jul 9, 2024
7e1fedc
added smoke testing to notebook
rfl-urbaniak Jul 10, 2024
3875ff5
add smoke test to zoning_unit_model.ipynb
rfl-urbaniak Jul 10, 2024
eccbe4f
removed irrelevant files
rfl-urbaniak Jul 10, 2024
9b05113
format, lint
rfl-urbaniak Jul 10, 2024
b01e2f6
removed redundant notebooks
rfl-urbaniak Jul 10, 2024
51092da
clean notebooks
rfl-urbaniak Jul 10, 2024
1f1d909
removed irrelevant tests for simple linear
rfl-urbaniak Jul 10, 2024
27f6ce6
added distance to DistanceCausalModel
rfl-urbaniak Jul 11, 2024
735d49e
distance causal model
rfl-urbaniak Jul 11, 2024
6f2b3f3
zoning query WIP
rfl-urbaniak Jul 11, 2024
88cb07d
exporting predictor object
rfl-urbaniak Jul 12, 2024
f456dd9
preds for frontent, first pass
rfl-urbaniak Jul 12, 2024
ae84f56
added preds zip
rfl-urbaniak Jul 12, 2024
1eb253e
black notebooks
rfl-urbaniak Jul 12, 2024
bb4981f
minor
rfl-urbaniak Jul 18, 2024
63e1ebd
update refs to PredictiveModel
rfl-urbaniak Jul 18, 2024
0a79030
minimal missingness model
rfl-urbaniak Jul 18, 2024
61cc513
missingness only model performance on synthetic data tested
rfl-urbaniak Jul 18, 2024
07583aa
logistic component abstraction and missingness model performance on s…
rfl-urbaniak Jul 18, 2024
2640989
missingness only model performance on synthetic data tested
rfl-urbaniak Jul 18, 2024
3f14e45
Merge branch 'ru-missingness-model' of https://github.com/BasisResear…
rfl-urbaniak Jul 20, 2024
364a222
census tract mappings in gitignore
rfl-urbaniak Jul 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,10 @@ tests/.coverage
.vscode/launch.json
data/sql/counties_database.db
data/sql/msa_database.db
.Rproj.user
**/*.RData
**/*.Rhistory

# data
data/minneapolis/processed/values_long.csv
data/minneapolis/sourced/parcel_to_census_tract_mappings/**
240 changes: 240 additions & 0 deletions cities/modeling/evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
import os

import matplotlib.pyplot as plt
import pyro
import seaborn as sns
import torch
from pyro.infer import Predictive
from torch.utils.data import DataLoader, random_split

from cities.modeling.svi_inference import run_svi_inference
from cities.utils.data_grabber import find_repo_root
from cities.utils.data_loader import select_from_data

root = find_repo_root()


def prep_data_for_test(train_size=0.8):
zoning_data_path = os.path.join(
root, "data/minneapolis/processed/zoning_dataset.pt"
)
zoning_dataset_read = torch.load(zoning_data_path)

train_size = int(train_size * len(zoning_dataset_read))
test_size = len(zoning_dataset_read) - train_size

train_dataset, test_dataset = random_split(
zoning_dataset_read, [train_size, test_size]
)

train_loader = DataLoader(train_dataset, batch_size=train_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=test_size, shuffle=False)

categorical_levels = zoning_dataset_read.categorical_levels

return train_loader, test_loader, categorical_levels


def recode_categorical(kwarg_names, train_loader, test_loader):

assert all(
item in kwarg_names.keys() for item in ["categorical", "continuous", "outcome"]
)
assert kwarg_names["outcome"] not in kwarg_names["continuous"]

train_data = next(iter(train_loader))
test_data = next(iter(test_loader))

_train_data = select_from_data(train_data, kwarg_names)
_test_data = select_from_data(test_data, kwarg_names)

#####################################################
# eliminate test categories not in the training data
#####################################################
def apply_mask(data, mask):
return {key: val[mask] for key, val in data.items()}

mask = torch.ones(len(_test_data["outcome"]), dtype=torch.bool)
for key, value in _test_data["categorical"].items():
mask = mask * torch.isin(
_test_data["categorical"][key], (_train_data["categorical"][key].unique())
)

_test_data["categorical"] = apply_mask(_test_data["categorical"], mask)
_test_data["continuous"] = apply_mask(_test_data["continuous"], mask)
_test_data["outcome"] = _test_data["outcome"][mask]

for key in _test_data["categorical"].keys():
assert _test_data["categorical"][key].shape[0] == mask.sum()
for key in _test_data["continuous"].keys():
assert _test_data["continuous"][key].shape[0] == mask.sum()

# raise error if sum(mask) < .5 * len(test_data['outcome'])
if sum(mask) < 0.5 * len(_test_data["outcome"]):
raise ValueError(
"Sampled test data has too many new categorical levels, consider decreasing train size"
)

######################################
# recode categorical variables to have
# no index gaps in the training data
# ####################################

mappings = {}
for name in _train_data["categorical"].keys():
unique_train = torch.unique(_train_data["categorical"][name])
mappings[name] = {v.item(): i for i, v in enumerate(unique_train)}
_train_data["categorical"][name] = torch.tensor(
[mappings[name][x.item()] for x in _train_data["categorical"][name]]
)
_test_data["categorical"][name] = torch.tensor(
[mappings[name][x.item()] for x in _test_data["categorical"][name]]
)

# for key in _train_data['categorical'].keys():
# print(key)
# print(_train_data['categorical'][key].unique())
# print(_test_data['categorical'][key].unique())
# TODO codsider adding assertion
# assert torch.all(test_data['categorical'][key].unique() in _train_data['categorical'][key].unique())

return _train_data, _test_data


def test_performance(
model_or_class,
kwarg_names,
train_loader,
test_loader,
categorical_levels,
n_steps=600,
plot=True,
is_class=True,
):
_train_data, _test_data = recode_categorical(kwarg_names, train_loader, test_loader)

pyro.clear_param_store()
# TODO perhaps remove the original categorical levels here

######################
# train and test
######################

if is_class:
model = model_or_class(**_train_data)

else:
model = model_or_class

guide = run_svi_inference(
model, n_steps=n_steps, lr=0.01, verbose=True, **_train_data
)

predictive = Predictive(model, guide=guide, num_samples=1000)

categorical_levels = model.categorical_levels

samples_training = predictive(
categorical=_train_data["categorical"],
continuous=_train_data["continuous"],
outcome=None,
categorical_levels=categorical_levels,
)

samples_test = predictive(
categorical=_test_data["categorical"],
continuous=_test_data["continuous"],
outcome=None,
categorical_levels=categorical_levels,
)

train_predicted_mean = (
samples_training[kwarg_names["outcome"]].squeeze().mean(dim=0)
)
train_predicted_lower = (
samples_training[kwarg_names["outcome"]].squeeze().quantile(0.05, dim=0)
)
train_predicted_upper = (
samples_training[kwarg_names["outcome"]].squeeze().quantile(0.95, dim=0)
)

coverage_training = (
_train_data["outcome"].squeeze().gt(train_predicted_lower).float()
* _train_data["outcome"].squeeze().lt(train_predicted_upper).float()
)
residuals_train = _train_data["outcome"].squeeze() - train_predicted_mean
mae_train = torch.abs(residuals_train).mean().item()

rsquared_train = 1 - residuals_train.var() / _train_data["outcome"].squeeze().var()

test_predicted_mean = samples_test[kwarg_names["outcome"]].squeeze().mean(dim=0)
test_predicted_lower = (
samples_test[kwarg_names["outcome"]].squeeze().quantile(0.05, dim=0)
)
test_predicted_upper = (
samples_test[kwarg_names["outcome"]].squeeze().quantile(0.95, dim=0)
)

coverage_test = (
_test_data["outcome"].squeeze().gt(test_predicted_lower).float()
* _test_data["outcome"].squeeze().lt(test_predicted_upper).float()
)
residuals_test = _test_data["outcome"].squeeze() - test_predicted_mean
mae_test = torch.abs(residuals_test).mean().item()

rsquared_test = 1 - residuals_test.var() / _test_data["outcome"].squeeze().var()

if plot:
fig, axs = plt.subplots(2, 2, figsize=(14, 10))

axs[0, 0].scatter(
x=_train_data["outcome"], y=train_predicted_mean, s=6, alpha=0.5
)
axs[0, 0].set_title(
"Training data, ratio of outcomes within 95% CI: {:.2f}".format(
coverage_training.mean().item()
)
)
axs[0, 0].set_xlabel("true outcome")
axs[0, 0].set_ylabel("mean predicted outcome")

axs[0, 1].hist(residuals_train, bins=50)
axs[0, 1].set_title(
"Training set residuals, Rsquared: {:.2f}".format(rsquared_train.item())
)
axs[0, 1].set_xlabel("residuals")
axs[0, 1].set_ylabel("frequency")

axs[1, 0].scatter(
x=_test_data["outcome"], y=test_predicted_mean, s=6, alpha=0.5
)
axs[1, 0].set_title(
"Test data, ratio of outcomes within 95% CI: {:.2f}".format(
coverage_test.mean().item()
)
)
axs[1, 0].set_xlabel("true outcome")
axs[1, 0].set_ylabel("mean predicted outcome")

axs[1, 1].hist(residuals_test, bins=50)
axs[1, 1].set_title(
"Test set residuals, Rsquared: {:.2f}".format(rsquared_test.item())
)
axs[1, 1].set_xlabel("residuals")
axs[1, 1].set_ylabel("frequency")

plt.tight_layout(rect=[0, 0, 1, 0.96])
sns.despine()

fig.suptitle("Model evaluation", fontsize=16)

plt.show()

return {
"mae_train": mae_train,
"mae_test": mae_test,
"rsquared_train": rsquared_train,
"rsquared_test": rsquared_test,
"coverage_train": coverage_training.mean().item(),
"coverage_test": coverage_test.mean().item(),
}
2 changes: 1 addition & 1 deletion cities/modeling/model_interactions.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
from typing import Optional

import dill
import pyro
import pyro.distributions as dist
import torch

import pyro
from cities.modeling.modeling_utils import (
prep_wide_data_for_inference,
train_interactions_model,
Expand Down
2 changes: 1 addition & 1 deletion cities/modeling/modeling_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

import matplotlib.pyplot as plt
import pandas as pd
import pyro
import torch
from pyro.infer import SVI, Trace_ELBO
from pyro.infer.autoguide import AutoNormal
from pyro.optim import Adam # type: ignore
from scipy.stats import spearmanr

import pyro
from cities.utils.data_grabber import (
DataGrabber,
list_available_features,
Expand Down
41 changes: 41 additions & 0 deletions cities/modeling/svi_inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import matplotlib.pyplot as plt
import pyro
import torch
from pyro.infer.autoguide import AutoMultivariateNormal, init_to_mean


def run_svi_inference(
model,
verbose=True,
lr=0.03,
vi_family=AutoMultivariateNormal,
guide=None,
hide = None,
n_steps=500,
ylim = None,
**model_kwargs
):
losses = []
if guide is None:
guide = vi_family(pyro.poutine.block(model, hide=hide), init_loc_fn=init_to_mean)
elbo = pyro.infer.Trace_ELBO()(model, guide)

elbo(**model_kwargs)
adam = torch.optim.Adam(elbo.parameters(), lr=lr)

for step in range(1, n_steps + 1):
adam.zero_grad()
loss = elbo(**model_kwargs)
loss.backward()
losses.append(loss.item())
adam.step()
if (step % 50 == 0) or (step == 1) & verbose:
print("[iteration %04d] loss: %.4f" % (step, loss))

plt.plot(losses)
if ylim:
plt.ylim(ylim)
plt.show()


return guide
Loading
Loading