diff --git a/docs/guides/zoning-tracts-model.ipynb b/docs/guides/zoning-tracts-model.ipynb index d3b33dd3..4c7e5849 100644 --- a/docs/guides/zoning-tracts-model.ipynb +++ b/docs/guides/zoning-tracts-model.ipynb @@ -4,173 +4,162 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Outline" + "## Background and motivations\n", + "\n", + "\n", + "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Background and motivations\n", + "In urban planning, understanding the factors that drive housing development is crucial for sustainable growth. Cities often use zoning regulations to influence where and how much housing is built. However, these effects are not always straightforward to predict, as they involve a complex interplay of economic, demographic, and spatial factors. \n", "\n", + "In this analysis, we are working to identify the best predictors of housing unit availability. The goal is to understand which factors or regulatory variables most effectively increase the number of housing units. By focusing on the city of Minneapolis, the dataset gives us insights into various social, economic, and spatial characteristics across census tracts, enabling us to examine potential impacts on housing supply.\n", "\n", - "\n" + "To capture the causal relationships between these variables and housing availability, we’re leveraging the Pyro library, a tool specifically suited for probabilistic programming and causal inference. Pyro offers a powerful framework for causal modeling, especially beneficial in scenarios where the underlying relationships between variables are complex and uncertain. You can read more about pyro here: https://pyro.ai/\n" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "#check out zoning_new_data_pipeline\n", - "use viz" + "### The dataset" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "\n", - "#this comes from test_tracts_model.py\n", - "\n", - "data_path = os.path.join(root, \"data/minneapolis/processed/pg_census_tracts_dataset.pt\")\n", - "\n", - "dataset_read = torch.load(data_path, weights_only=False)\n", - "\n", - "loader = DataLoader(dataset_read, batch_size=len(dataset_read), shuffle=True)\n", - "\n", - "data = next(iter(loader))\n", - "\n", - "\n", - "kwargs = {\n", - " \"categorical\": [\"year\", \"census_tract\"],\n", - " \"continuous\": {\n", - " \"housing_units\",\n", - " \"total_value\",\n", - " \"median_value\",\n", - " \"mean_limit_original\",\n", - " \"median_distance\",\n", - " \"income\",\n", - " \"segregation_original\",\n", - " \"white_original\",\n", - " \"parcel_mean_sqm\",\n", - " \"parcel_median_sqm\",\n", - " \"parcel_sqm\",\n", - " \"downtown_overlap\",\n", - " \"university_overlap\",\n", - " },\n", - " \"outcome\": \"housing_units\",\n", - "}\n", - "\n", - "\n", - "pg_subset = select_from_data(data, kwargs)\n", - "pg_dataset_read = torch.load(data_path, weights_only=False)\n" + "**Source and details here**\n", + "The dataset was obtained..." ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "## tytulem wstepu, dane z permitow, tak jak w zoning data, babelki i media" + "The Person's correlation plot below represents the relationship between the variables in the dataset:" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "## Zmienne" + "![](..\\experimental_notebooks\\zoning\\corr_plot.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Causal modeling in general\n", - "\n", - "tracts model overview, read, possibly update, \n", - "the graphics is outdated, generate new one using ....dags.R\n", - "looking at the rendering from zoning_tracts_continuous_interactions.ipynb\n", - "\n", - "defined in \n", + "In the `zoning_tracts_data.ipynb` notebook, the data is structured into a format suitable for modeling by aggregating the number of housing units across Minneapolis by year and census tract ID. The total number of housing units within each census tract is calculated by summing the individual units, ensuring no loss of data due to the non-overlapping nature of the parcels. Additionally, two key metrics are derived: `median_value` and `summed_value`, which represent the respective values of housing units within each census tract. This comprehensive data structure facilitates insightful analysis and model development.\n", "\n", - "zoning_tracts_continuous_interactions_model.py\n", - "\n", - "btw update tracts_model_overiew with new graphics\n" + "The model uses the following variables:" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "## Construction and evaluation\n", + "- **Categorical Variables**: `year`, `census_tract`\n", "\n", - "- directions of causal assumptions are rather natural, were happy for the user to modify and iterate\n", - "- in adding variables we were frugal, at each step evaluating the model in terms of train-test split\n", - "and WAIC \n", + "- **Continuous Variables**: `housing_units`, `total_value`, `median_value`, `mean_limit_original`, `median_distance`, `income`, `segregation_original`, `white_original`, `parcel_mean_sqm`, `parcel_median_sqm`, `parcel_sqm`, `downtown_overlap`, `university_overlap`\n", "\n", - "-explain interactions as essentially adding another continuous predictor\n", - "- explain waic briefly as well\n", - "\n", - "example of performance results, also with the original scale\n" + "- **Outcome Variable**: `housing_units`" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "## Outliers\n", - "\n", - "\n", - "\n", - "messy environment, high granularity, hard to predict some extreme events\n", - "in particular, the reform does not touch university and downtown, which had their own regulation\n", - "especially downtown underwent modifications not captured by the data, \n", - "\n", - "graph residuals for regions\n", - "\n", - "statsy outlierow " + "### Causal Modeling \n", + "\n", + "The `TractsModelContinuousInteractions` class constructs a causal model that includes continuous interaction terms, where variables like `income`, `limit`, and `distance` interact to predict outcomes. This is achieved using the following modeling approach:\n", + "\n", + "1. **Model Structure**: \n", + " - The model is structured hierarchically, where each dependent variable (like `housing_units`, `income`, and `segregation`) is regressed on a combination of categorical and continuous predictors.\n", + " - For example, the relationship for housing units can be expressed mathematically as:\n", + " $$\n", + " \\text{housing\\_units} = f(\\text{distance}, \\text{income}, \\text{white}, \\text{segregation}, \\text{sqm}, \\text{...}) + \\epsilon\n", + " $$\n", + " - Here, $ f $ represents the functional form (e.g., linear regression), and $ \\epsilon $ is the error term, capturing unobserved influences.\n", + "\n", + "2. **Sampling**: \n", + " - The model uses probabilistic inference methods to sample from the posterior distribution of the model parameters using Pyro’s `pyro.sample` function. For instance, categorical variables are sampled using a categorical distribution, while continuous variables are sampled from a normal distribution:\n", + " $$\n", + " y \\sim \\text{Categorical}(\\pi), \\quad x \\sim \\mathcal{N}(\\mu, \\sigma)\n", + " $$\n", + "\n", + "3. **Components of the Model**: \n", + " - The model integrates multiple components that represent specific relationships. For example:\n", + " - **Linear Components**: Models like `add_linear_component` estimate relationships based on linear regression equations.\n", + " - **Ratio Components**: Models like `add_ratio_component` handle variables that are ratios, such as `segregation` and `income`, allowing for multiplicative interactions between predictors." ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "## Interventions\n", "\n", - "ogolnie co to jest interwencja w tym kontekscie\n", + "#### Continuous Interactions\n", + "Continuous interactions are treated as additional predictors in the model. For instance, if `limit` interacts with `income`, this is captured by adding terms like `limit * income` in the regression structure, allowing us to explore how multiple factors together influence housing outcomes.\n", "\n", - "### Brute force example\n", + "### WAIC and Model Evaluation\n", + "**Watanabe-Akaike Information Criterion (WAIC)** is employed to evaluate and compare different model configurations. It is particularly suited to Bayesian models, as it considers the entire posterior distribution rather than just point estimates. WAIC works by calculating the **log pointwise predictive density (lppd)**, averaged across posterior samples, to capture how well the model predicts the observed data, while also penalizing for model complexity to avoid overfitting.\n", "\n", - "wszedzie zero wszedzie 1, porownanie\n", + "$$\n", + "\\text{WAIC} = -2 \\cdot (\\text{lppd} - \\text{penalty})\n", + "$$\n", + "where:\n", + " - **lppd** sums the log-likelihood of each data point, averaged across posterior samples.\n", + " - **penalty** accounts for model complexity by considering the variance of log-likelihood values across posterior samples.\n", "\n", - "### In line with the reform\n", + "Lower WAIC values indicate a model that better balances predictive accuracy and complexity. By comparing WAIC values across models, we can determine which model best captures the data's patterns while avoiding overfitting. If one model has a significantly lower WAIC, it is considered to offer better predictive performance. Thus, WAIC guides the model development process, helping select a model that provides an optimal fit without unnecessary complexity." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Outliers\n", "\n", - "predict.py contains sql\n", + "Urban zoning and housing are subject to many unpredictable factors, resulting in high variability and outliers. For instance, zones like university areas and downtown are governed by unique regulations, which introduces unexpected patterns in housing data.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Interventions\n", "\n", - "zoning_tracts_intervention_testing.inpyb\n", + "Interventions in causal models allow us to simulate hypothetical policy changes. By directly manipulating a variable, such as setting zoning `limit` values to different levels, we can analyze potential impacts on housing units.\n", "\n", - "zawiera kilka najgorszych outlierow (ktore jako przyklady bez interwencji wczesniej mozna podac)\n", + "We begin with brute-force interventions by setting variables to extreme values (e.g., all zeros or all ones). This provides a general idea of the effect range based on zoning limits alone.\n", "\n", - "I wyjasnic roznice miedzy observed, factual, counterfactual" + "Using the `do` operator in Pyro, we simulate a realistic intervention scenario by adjusting `limit` values in line with Minneapolis reforms. This approach helps compare observed values with both factual and counterfactual predictions, examining how Minneapolis’ zoning changes might have impacted housing availability.\n" ] } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" } }, "nbformat": 4,